10 LLM Interview Questions That Catch People Off Guard
Surface-level LLM knowledge gets exposed fast. These 10 questions separate candidates who read papers from those who only watched YouTube.
1. Why does temperature=0 still sometimes give non-deterministic output?
Floating-point non-associativity on GPU. Same prompt → different ordering of partial sums → different logits at the last decimal → different argmax in rare ties.
2. What’s the actual VRAM cost of a 70B model at inference?
Base weights at bf16: 140 GB. KV-cache for one 8K-context request: ~5 GB. Total per replica: ~150 GB. You need either 2x A100 80G or 1x H100 80G with INT8 quantization.
3. Why is GQA used instead of MHA in modern models?
Multi-Head Attention duplicates K and V for every head — dominates KV-cache memory at inference. GQA shares K/V across groups (e.g., 32 query heads share 8 KV heads). Memory drops 4x, quality drops <1%.
4. What does rope_theta control in Llama 3?
The base frequency for rotary embeddings. Llama 3.1 sets it to 500,000 (vs 10,000 in Llama 2) to extend context to 128K without retraining position embeddings from scratch.
5. When does Flash Attention NOT help?
Sequence length under ~512. The kernel launch overhead dominates and you don’t hit the memory wall the technique was designed to break.
6. Why is BPE language-dependent in practice?
The vocab is learned from frequencies in the training corpus. English-trained tokenizers split Hindi/Tamil/Chinese into many more tokens, making non-English inference more expensive and slower.
7. Why does the FFN have 4x expansion?
Empirically optimal trade-off between capacity and compute. Smaller expansion underfits; larger has diminishing returns. Some models (PaLM) use 8x with adjustments.
8. What’s the difference between top-k and top-p sampling?
top-k: keep highest-k tokens, sample from those. top-p (nucleus): keep smallest set whose probabilities sum to p, sample from those. top-p adapts to confidence — when the model is confident, the set is small.
9. Why does use_cache=True break gradient checkpointing?
Gradient checkpointing recomputes activations during backward pass. KV-cache assumes activations exist. They’re mutually exclusive — pick one.
10. What’s the role of the [BOS] token?
Marks sequence start; gives the model a stable starting state. In decoder-only models, omitting it can degrade quality because the model was trained to condition on it. Always include it (most tokenizers do automatically).
Enjoyed this article?
Join 500+ AI developers getting weekly tips, news and resources from AmanAI Lab.
No spam. Unsubscribe anytime.
More in Interview Prep
Discussion
Sign in to comment →Join the discussion
Sign in with your AmanAI Lab account — it takes 30 seconds.