Fine-tuning adapts pre-trained models for specific tasks. Interviews focus on parameter-efficient methods (LoRA, QLoRA), RLHF, DPO, data preparation, and evaluation.
Key Concepts to Know
Practice Fine-Tuning with AI
Timed session with instant scoring, voice support, and model answers.
14 Interview Questions
Browse all topics →How do you prepare a dataset for instruction fine-tuning?
Model Answer
Format: convert to instruction-following format — system prompt + user instruction + assistant response. Common formats: ChatML (<|im_start|>system ...<|im_end|>), Llama chat format, Alpaca format. Data quality checklist: remove duplicates, filter short/low-quality responses, ensure diverse topics, balance difficulty levels. Data sources: Alpaca (GPT-4 generated), OpenAssistant (human conversations), FLAN (academic tasks), ShareGPT (real ChatGPT conversations). Min size: 1K examples for domain fine-tuning, 10K+ for general instruction following. Always evaluate on a held-out set.
What is catastrophic forgetting and how do you mitigate it in fine-tuning?
Model Answer
Catastrophic forgetting: when fine-tuning on new data, the model forgets previously learned knowledge. Mitigation strategies: Elastic Weight Consolidation (EWC) — penalize changes to important weights, Replay — mix original training data with new data, LoRA — only updates low-rank matrices so base model knowledge is preserved, low learning rate (1e-4 to 1e-5), few training epochs (1-3), domain-specific evaluation to detect forgetting. For instruction fine-tuning: include a mix of general instruction data (Alpaca, FLAN) with domain-specific data.
What is DPO (Direct Preference Optimization) and how does it compare to RLHF?
Model Answer
DPO trains directly from human preference data without a separate reward model or RL. It reformulates RLHF as a classification problem: given pairs of (chosen, rejected) responses, maximize likelihood of chosen over rejected with a KL constraint. Advantages over RLHF: no reward model training, no PPO (which is unstable), simpler pipeline, computationally cheaper. The objective: maximize log(σ(β × (log(π/π_ref)(chosen) - log(π/π_ref)(rejected)))). Used by Llama 2 chat and many open models. RLHF still has edge in some benchmarks but DPO is the practical choice.
What is GRPO (Group Relative Policy Optimization) and how does it enable LLMs to reason?
Model Answer
GRPO (DeepSeek-R1) is a variant of PPO that eliminates the need for a separate value function model. For a given prompt, generate G responses, compute rewards for each, normalize relative to the group mean. This group comparison acts as a baseline for variance reduction. Combined with a "thinking" format (model generates reasoning before answer), GRPO trains models to reason step-by-step on math and coding problems. DeepSeek-R1 matches o1-preview performance using GRPO. Advantage over PPO: no value model = simpler, less memory, more stable training. Advantage over DPO: works with verifiable rewards (math correctness), not just human preferences.
Explain QLoRA and how it enables fine-tuning large models on consumer hardware.
Model Answer
QLoRA (Quantized LoRA) combines: 1) 4-bit NormalFloat quantization of the frozen base model (NF4 is optimal for normally-distributed weights), 2) Double quantization (quantize the quantization constants), 3) Paged optimizers (offload optimizer states to CPU when GPU OOM). LoRA adapters are trained in 16-bit while the base model stays 4-bit. Result: 65B model fine-tunes on a single 48GB GPU (vs 640GB for full 16-bit). Key hyperparameters: r (rank), alpha (scaling), target modules (typically q_proj, v_proj).
What is the difference between full fine-tuning and PEFT methods?
Model Answer
Full fine-tuning: update all model parameters, requires massive GPU memory (e.g., 7B model needs ~28GB in bfloat16 just to store weights, plus optimizer states ~3x), but achieves best performance. PEFT (Parameter-Efficient Fine-Tuning): update only a small fraction of parameters. Methods: LoRA (inject low-rank matrices, ~0.1% of params), Prefix Tuning (prepend trainable tokens), Adapters (add small feed-forward layers between transformer blocks), Prompt Tuning (optimize continuous prompt vectors). PEFT preserves base model weights, allows serving multiple fine-tuned versions with one base model.
What is DPO (Direct Preference Optimization) and how does it compare to RLHF?
Model Answer
DPO trains directly from human preference data without a separate reward model or RL. It reformulates RLHF as a classification problem: given pairs of (chosen, rejected) responses, maximize likelihood of chosen over rejected with a KL constraint. Advantages over RLHF: no reward model training, no PPO (which is unstable), simpler pipeline, computationally cheaper. The objective: maximize log(σ(β × (log(π/π_ref)(chosen) - log(π/π_ref)(rejected)))). Used by Llama 2 chat and many open models. RLHF still has edge in some benchmarks but DPO is the practical choice.
How do you prepare a dataset for instruction fine-tuning?
Model Answer
Format: convert to instruction-following format — system prompt + user instruction + assistant response. Common formats: ChatML (<|im_start|>system ...<|im_end|>), Llama chat format, Alpaca format. Data quality checklist: remove duplicates, filter short/low-quality responses, ensure diverse topics, balance difficulty levels. Data sources: Alpaca (GPT-4 generated), OpenAssistant (human conversations), FLAN (academic tasks), ShareGPT (real ChatGPT conversations). Min size: 1K examples for domain fine-tuning, 10K+ for general instruction following. Always evaluate on a held-out set.
What is GRPO (Group Relative Policy Optimization) and how does it enable LLMs to reason?
Model Answer
GRPO (DeepSeek-R1) is a variant of PPO that eliminates the need for a separate value function model. For a given prompt, generate G responses, compute rewards for each, normalize relative to the group mean. This group comparison acts as a baseline for variance reduction. Combined with a "thinking" format (model generates reasoning before answer), GRPO trains models to reason step-by-step on math and coding problems. DeepSeek-R1 matches o1-preview performance using GRPO. Advantage over PPO: no value model = simpler, less memory, more stable training. Advantage over DPO: works with verifiable rewards (math correctness), not just human preferences.
When is fine-tuning the wrong choice, and what should you do instead?
Model Answer
Fine-tuning is wrong when: (1) you need fresh / changing knowledge — use RAG instead, (2) you have <500 examples — few-shot prompting will outperform, (3) you need format compliance only — JSON mode or structured outputs work better, (4) you can't collect a clean eval set — you'll fly blind. Fine-tuning IS right when: enforcing tone/style at scale, distilling a large model into a small one, adapting to a domain language (legal, medical) where prompting plateaus, reducing inference cost by enabling a smaller base model. Rule of thumb: try RAG + prompting first, fine-tune only when you've hit a quality ceiling those can't fix.
What is GRPO and how did DeepSeek use it for reasoning?
Model Answer
Group Relative Policy Optimization is a PPO variant that drops the value (critic) network. For each prompt, the policy generates G responses, you compute the reward of each, and use the GROUP mean as a baseline (instead of a learned value function). Advantages over PPO: ~50% less GPU memory, more stable training, no value-function bias. DeepSeek-R1 paired GRPO with verifiable rewards (math correctness, code passes tests) so no reward model is needed. The training learns to "think before answering" without any human-labeled reasoning traces — the format emerges from rewarding correct final answers.
Explain QLoRA and how it enables fine-tuning large models on consumer hardware.
Model Answer
QLoRA (Quantized LoRA) combines: 1) 4-bit NormalFloat quantization of the frozen base model (NF4 is optimal for normally-distributed weights), 2) Double quantization (quantize the quantization constants), 3) Paged optimizers (offload optimizer states to CPU when GPU OOM). LoRA adapters are trained in 16-bit while the base model stays 4-bit. Result: 65B model fine-tunes on a single 48GB GPU (vs 640GB for full 16-bit). Key hyperparameters: r (rank), alpha (scaling), target modules (typically q_proj, v_proj).
What is catastrophic forgetting and how do you mitigate it in fine-tuning?
Model Answer
Catastrophic forgetting: when fine-tuning on new data, the model forgets previously learned knowledge. Mitigation strategies: Elastic Weight Consolidation (EWC) — penalize changes to important weights, Replay — mix original training data with new data, LoRA — only updates low-rank matrices so base model knowledge is preserved, low learning rate (1e-4 to 1e-5), few training epochs (1-3), domain-specific evaluation to detect forgetting. For instruction fine-tuning: include a mix of general instruction data (Alpaca, FLAN) with domain-specific data.
What is the difference between full fine-tuning and PEFT methods?
Model Answer
Full fine-tuning: update all model parameters, requires massive GPU memory (e.g., 7B model needs ~28GB in bfloat16 just to store weights, plus optimizer states ~3x), but achieves best performance. PEFT (Parameter-Efficient Fine-Tuning): update only a small fraction of parameters. Methods: LoRA (inject low-rank matrices, ~0.1% of params), Prefix Tuning (prepend trainable tokens), Adapters (add small feed-forward layers between transformer blocks), Prompt Tuning (optimize continuous prompt vectors). PEFT preserves base model weights, allows serving multiple fine-tuned versions with one base model.
Related Topics