Large Language Models power modern AI products. Interviews test your understanding of transformer internals, generation strategies, prompt engineering, and production challenges like hallucination and latency.
Key Concepts to Know
Practice LLM with AI
Timed session with instant scoring, voice support, and model answers.
27 Interview Questions
Browse all topics →What is speculative decoding and how does it speed up inference?
Model Answer
Speculative decoding uses a small "draft" model to generate candidate tokens quickly, then uses the larger target model to verify multiple tokens in parallel. Since the target model can process multiple tokens in one forward pass (verification is faster than generation), you get significant speedups when the draft model's predictions are often correct. Deepmind showed 2-3x speedup with minimal quality loss. The technique is now used in production at Anthropic and other labs.
What is Constitutional AI and how does Anthropic use it?
Model Answer
Constitutional AI (CAI) is a training method where the model is guided by a set of principles (a "constitution") rather than just human feedback. The process: 1) Generate responses, 2) Have the AI critique its own responses against the constitution, 3) Revise based on critiques, 4) Train on the revised responses. This reduces reliance on human labelers for identifying harmful content. Anthropic uses CAI to train Claude, making it helpful, harmless, and honest without needing extensive human feedback on every harmful output.
What is chain-of-thought prompting and when should you use it?
Model Answer
Chain-of-thought (CoT) prompting encourages the model to show its reasoning steps before giving a final answer. Example: "Q: Roger has 5 tennis balls... Let's think step by step: Roger starts with 5 balls, buys 2 cans of 3 balls each, so 2×3=6 new balls, 5+6=11 total. A: 11." CoT significantly improves performance on math, reasoning, and logical tasks. Use when: the task requires multi-step reasoning, arithmetic, or logical deduction. Zero-shot CoT: just append "Let's think step by step" to your prompt.
What is the difference between top-k and top-p sampling?
Model Answer
Top-k keeps the highest-k tokens by probability and samples from those (k is a fixed integer). Top-p (nucleus) keeps the smallest set of tokens whose probabilities sum to >= p (p in [0,1]) — the set grows when the model is uncertain and shrinks when confident, so it adapts to entropy. Production setups usually combine both: top-k=50 as a hard ceiling, top-p=0.95 to trim the long tail. Pure greedy (k=1 or temperature=0) gives the highest probability token at every step but produces repetitive, often degraded text on creative tasks.
What are the key differences between encoder-only, decoder-only, and encoder-decoder architectures?
Model Answer
Encoder-only (BERT): bidirectional attention, sees all tokens at once, good for classification, NER, QA. Cannot generate text autoregressively. Decoder-only (GPT): causal/unidirectional attention, generates tokens left-to-right, best for generation tasks. Encoder-decoder (T5, BART): encoder processes input with bidirectional attention, decoder generates output with cross-attention to encoder. Best for seq2seq tasks like translation, summarization. Most modern LLMs are decoder-only as generation is the dominant use case.
What is context window and how do long-context models work?
Model Answer
Context window is the maximum number of tokens a model can process at once (input + output). GPT-4 Turbo: 128k tokens (~96k words). Techniques for long context: Sliding window attention (Mistral, Longformer) — each token attends only to nearby tokens + a few global ones. RoPE with extended base frequency (Llama 3.1 uses rope_theta=500000 for 128k context). ALiBi linear biases. Challenges: quadratic attention complexity (Flash Attention mitigates), "lost in the middle" problem, evaluation on long contexts is hard. In practice: RAG often beats long context for retrieval tasks because it's cheaper and more precise.
Why is there a [BOS] token in most decoder-only models?
Model Answer
Beginning-of-sequence token signals the start of generation and gives the model a fixed conditioning state. Decoder-only models were trained with [BOS] prepended to every sequence — omitting it at inference time shifts the input distribution and degrades quality. Some tokenizers (Llama, GPT) add it automatically; with raw tokenizers you must add it yourself. The [BOS] embedding is also what the model attends to before any user tokens exist, providing a stable "I am starting" representation.
Explain prompt caching and when to use it.
Model Answer
Prompt caching stores the model's internal state (KV-cache) for a stable prompt prefix so subsequent requests with the same prefix skip recomputation. Anthropic and OpenAI both expose this via API headers. Best uses: long system prompts shared across users, RAG with a stable context document, agents replaying tool definitions. Wins: 50-90% latency reduction on the cached prefix, ~10% of normal token cost. Caveats: minimum prefix length, TTL (~5 min), only the EXACT prefix is cached — even one different token invalidates.
What is context window and how do long-context models work?
Model Answer
Context window is the maximum number of tokens a model can process at once (input + output). GPT-4 Turbo: 128k tokens (~96k words). Techniques for long context: Sliding window attention (Mistral, Longformer) — each token attends only to nearby tokens + a few global ones. RoPE with extended base frequency (Llama 3.1 uses rope_theta=500000 for 128k context). ALiBi linear biases. Challenges: quadratic attention complexity (Flash Attention mitigates), "lost in the middle" problem, evaluation on long contexts is hard. In practice: RAG often beats long context for retrieval tasks because it's cheaper and more precise.
How does RLHF work and what are its limitations?
Model Answer
RLHF (Reinforcement Learning from Human Feedback): 1) Supervised fine-tuning on demonstrations, 2) Train a reward model on human preference comparisons, 3) Use PPO to optimize the policy (LLM) to maximize the reward model's score. Limitations: reward hacking (model finds ways to maximize reward that don't reflect actual quality), reward model can be fooled, human annotators have biases and inconsistencies, expensive and slow to scale, the KL penalty to prevent the model from drifting too far adds complexity.
How does RLHF work and what are its limitations?
Model Answer
RLHF (Reinforcement Learning from Human Feedback): 1) Supervised fine-tuning on demonstrations, 2) Train a reward model on human preference comparisons, 3) Use PPO to optimize the policy (LLM) to maximize the reward model's score. Limitations: reward hacking (model finds ways to maximize reward that don't reflect actual quality), reward model can be fooled, human annotators have biases and inconsistencies, expensive and slow to scale, the KL penalty to prevent the model from drifting too far adds complexity.
What is speculative decoding and how does it speed up inference?
Model Answer
Speculative decoding uses a small "draft" model to generate candidate tokens quickly, then uses the larger target model to verify multiple tokens in parallel. Since the target model can process multiple tokens in one forward pass (verification is faster than generation), you get significant speedups when the draft model's predictions are often correct. Deepmind showed 2-3x speedup with minimal quality loss. The technique is now used in production at Anthropic and other labs.
What is the role of rope_theta in modern long-context models?
Model Answer
rope_theta is the base frequency of Rotary Position Embeddings (RoPE). Higher theta = slower rotation per position = embeddings remain "different enough" at large positions. Llama 2 used 10,000 (good up to 4K). Llama 3.1 uses 500,000 to extend to 128K. You can scale theta after training (NTK-aware scaling, dynamic NTK) or interpolate positions (PI) to stretch context further at inference. Trade-off: higher theta gives longer context but slightly hurts performance in the original short-context range.
Explain the concept of scaling laws in LLMs.
Model Answer
Scaling laws (Kaplan et al., Chinchilla) describe predictable relationships between model size, training data, compute budget, and model performance. Key finding: performance improves as a power law with scale. Chinchilla paper showed that for a given compute budget, you should balance model size and training tokens approximately equally (e.g., 70B model trained on 1.4T tokens performs better than 280B model on 700B tokens). This guides decisions on training runs.
What are the key differences between encoder-only, decoder-only, and encoder-decoder architectures?
Model Answer
Encoder-only (BERT): bidirectional attention, sees all tokens at once, good for classification, NER, QA. Cannot generate text autoregressively. Decoder-only (GPT): causal/unidirectional attention, generates tokens left-to-right, best for generation tasks. Encoder-decoder (T5, BART): encoder processes input with bidirectional attention, decoder generates output with cross-attention to encoder. Best for seq2seq tasks like translation, summarization. Most modern LLMs are decoder-only as generation is the dominant use case.
What is a token in the context of LLMs?
Model Answer
A token is the basic unit of text that LLMs process. It can be a word, part of a word, or punctuation depending on the tokenizer. For example, "ChatGPT" might be split into ["Chat", "G", "PT"]. The most common tokenizer is BPE (Byte Pair Encoding). GPT-4 uses approximately 4 characters per token on average. Tokens determine the model's context window limit and API pricing.
What is the difference between GPT-3 and GPT-4 architecturally?
Model Answer
GPT-4 is a multimodal model that can accept image and text inputs, uses a larger context window (up to 128k tokens with Turbo), and is believed to use a Mixture of Experts (MoE) architecture with ~8 experts. GPT-3 is a dense transformer with 175B parameters, text-only, and 4k context window. GPT-4 has significantly improved reasoning, reduced hallucinations, and passes professional exams at high percentiles.
What is the difference between GPT-3 and GPT-4 architecturally?
Model Answer
GPT-4 is a multimodal model that can accept image and text inputs, uses a larger context window (up to 128k tokens with Turbo), and is believed to use a Mixture of Experts (MoE) architecture with ~8 experts. GPT-3 is a dense transformer with 175B parameters, text-only, and 4k context window. GPT-4 has significantly improved reasoning, reduced hallucinations, and passes professional exams at high percentiles.
What is temperature in LLM sampling and how does it affect output?
Model Answer
Temperature controls randomness in sampling. Low temperature (0.1-0.3): more deterministic, picks highest probability tokens, outputs are focused and repetitive. High temperature (0.8-1.2): more random, samples lower probability tokens, more creative and diverse. Temperature=0: greedy decoding (always pick the highest probability token). Temperature scales the logits before softmax: logits/temperature. Use low temperature for factual tasks, high temperature for creative writing. Temperature=1.0 = unchanged distribution.
What is chain-of-thought prompting and when should you use it?
Model Answer
Chain-of-thought (CoT) prompting encourages the model to show its reasoning steps before giving a final answer. Example: "Q: Roger has 5 tennis balls... Let's think step by step: Roger starts with 5 balls, buys 2 cans of 3 balls each, so 2×3=6 new balls, 5+6=11 total. A: 11." CoT significantly improves performance on math, reasoning, and logical tasks. Use when: the task requires multi-step reasoning, arithmetic, or logical deduction. Zero-shot CoT: just append "Let's think step by step" to your prompt.
Related Topics