Fine-Tuning Interview Questions — LoRA, RLHF, DPO | AmanAI Lab

mid

How do you prepare a dataset for instruction fine-tuning?

Model Answer

Format: convert to instruction-following format — system prompt + user instruction + assistant response. Common formats: ChatML (<|im_start|>system ...<|im_end|>), Llama chat format, Alpaca format. Data quality checklist: remove duplicates, filter short/low-quality responses, ensure diverse topics, balance difficulty levels. Data sources: Alpaca (GPT-4 generated), OpenAssistant (human conversations), FLAN (academic tasks), ShareGPT (real ChatGPT conversations). Min size: 1K examples for domain fine-tuning, 10K+ for general instruction following. Always evaluate on a held-out set.

mid

What is catastrophic forgetting and how do you mitigate it in fine-tuning?

Model Answer

Catastrophic forgetting: when fine-tuning on new data, the model forgets previously learned knowledge. Mitigation strategies: Elastic Weight Consolidation (EWC) — penalize changes to important weights, Replay — mix original training data with new data, LoRA — only updates low-rank matrices so base model knowledge is preserved, low learning rate (1e-4 to 1e-5), few training epochs (1-3), domain-specific evaluation to detect forgetting. For instruction fine-tuning: include a mix of general instruction data (Alpaca, FLAN) with domain-specific data.

senior

What is DPO (Direct Preference Optimization) and how does it compare to RLHF?

Model Answer

DPO trains directly from human preference data without a separate reward model or RL. It reformulates RLHF as a classification problem: given pairs of (chosen, rejected) responses, maximize likelihood of chosen over rejected with a KL constraint. Advantages over RLHF: no reward model training, no PPO (which is unstable), simpler pipeline, computationally cheaper. The objective: maximize log(σ(β × (log(π/π_ref)(chosen) - log(π/π_ref)(rejected)))). Used by Llama 2 chat and many open models. RLHF still has edge in some benchmarks but DPO is the practical choice.

senior

What is GRPO (Group Relative Policy Optimization) and how does it enable LLMs to reason?

Model Answer

GRPO (DeepSeek-R1) is a variant of PPO that eliminates the need for a separate value function model. For a given prompt, generate G responses, compute rewards for each, normalize relative to the group mean. This group comparison acts as a baseline for variance reduction. Combined with a "thinking" format (model generates reasoning before answer), GRPO trains models to reason step-by-step on math and coding problems. DeepSeek-R1 matches o1-preview performance using GRPO. Advantage over PPO: no value model = simpler, less memory, more stable training. Advantage over DPO: works with verifiable rewards (math correctness), not just human preferences.

senior

Explain QLoRA and how it enables fine-tuning large models on consumer hardware.

Model Answer

QLoRA (Quantized LoRA) combines: 1) 4-bit NormalFloat quantization of the frozen base model (NF4 is optimal for normally-distributed weights), 2) Double quantization (quantize the quantization constants), 3) Paged optimizers (offload optimizer states to CPU when GPU OOM). LoRA adapters are trained in 16-bit while the base model stays 4-bit. Result: 65B model fine-tunes on a single 48GB GPU (vs 640GB for full 16-bit). Key hyperparameters: r (rank), alpha (scaling), target modules (typically q_proj, v_proj).

mid

What is the difference between full fine-tuning and PEFT methods?

Model Answer

Full fine-tuning: update all model parameters, requires massive GPU memory (e.g., 7B model needs ~28GB in bfloat16 just to store weights, plus optimizer states ~3x), but achieves best performance. PEFT (Parameter-Efficient Fine-Tuning): update only a small fraction of parameters. Methods: LoRA (inject low-rank matrices, ~0.1% of params), Prefix Tuning (prepend trainable tokens), Adapters (add small feed-forward layers between transformer blocks), Prompt Tuning (optimize continuous prompt vectors). PEFT preserves base model weights, allows serving multiple fine-tuned versions with one base model.

senior

What is DPO (Direct Preference Optimization) and how does it compare to RLHF?

Model Answer

DPO trains directly from human preference data without a separate reward model or RL. It reformulates RLHF as a classification problem: given pairs of (chosen, rejected) responses, maximize likelihood of chosen over rejected with a KL constraint. Advantages over RLHF: no reward model training, no PPO (which is unstable), simpler pipeline, computationally cheaper. The objective: maximize log(σ(β × (log(π/π_ref)(chosen) - log(π/π_ref)(rejected)))). Used by Llama 2 chat and many open models. RLHF still has edge in some benchmarks but DPO is the practical choice.

mid

How do you prepare a dataset for instruction fine-tuning?

Model Answer

Format: convert to instruction-following format — system prompt + user instruction + assistant response. Common formats: ChatML (<|im_start|>system ...<|im_end|>), Llama chat format, Alpaca format. Data quality checklist: remove duplicates, filter short/low-quality responses, ensure diverse topics, balance difficulty levels. Data sources: Alpaca (GPT-4 generated), OpenAssistant (human conversations), FLAN (academic tasks), ShareGPT (real ChatGPT conversations). Min size: 1K examples for domain fine-tuning, 10K+ for general instruction following. Always evaluate on a held-out set.

senior

What is GRPO (Group Relative Policy Optimization) and how does it enable LLMs to reason?

Model Answer

GRPO (DeepSeek-R1) is a variant of PPO that eliminates the need for a separate value function model. For a given prompt, generate G responses, compute rewards for each, normalize relative to the group mean. This group comparison acts as a baseline for variance reduction. Combined with a "thinking" format (model generates reasoning before answer), GRPO trains models to reason step-by-step on math and coding problems. DeepSeek-R1 matches o1-preview performance using GRPO. Advantage over PPO: no value model = simpler, less memory, more stable training. Advantage over DPO: works with verifiable rewards (math correctness), not just human preferences.

mid

When is fine-tuning the wrong choice, and what should you do instead?

Model Answer

Fine-tuning is wrong when: (1) you need fresh / changing knowledge — use RAG instead, (2) you have <500 examples — few-shot prompting will outperform, (3) you need format compliance only — JSON mode or structured outputs work better, (4) you can't collect a clean eval set — you'll fly blind. Fine-tuning IS right when: enforcing tone/style at scale, distilling a large model into a small one, adapting to a domain language (legal, medical) where prompting plateaus, reducing inference cost by enabling a smaller base model. Rule of thumb: try RAG + prompting first, fine-tune only when you've hit a quality ceiling those can't fix.

senior

What is GRPO and how did DeepSeek use it for reasoning?

Model Answer

Group Relative Policy Optimization is a PPO variant that drops the value (critic) network. For each prompt, the policy generates G responses, you compute the reward of each, and use the GROUP mean as a baseline (instead of a learned value function). Advantages over PPO: ~50% less GPU memory, more stable training, no value-function bias. DeepSeek-R1 paired GRPO with verifiable rewards (math correctness, code passes tests) so no reward model is needed. The training learns to "think before answering" without any human-labeled reasoning traces — the format emerges from rewarding correct final answers.

senior

Explain QLoRA and how it enables fine-tuning large models on consumer hardware.

Model Answer

QLoRA (Quantized LoRA) combines: 1) 4-bit NormalFloat quantization of the frozen base model (NF4 is optimal for normally-distributed weights), 2) Double quantization (quantize the quantization constants), 3) Paged optimizers (offload optimizer states to CPU when GPU OOM). LoRA adapters are trained in 16-bit while the base model stays 4-bit. Result: 65B model fine-tunes on a single 48GB GPU (vs 640GB for full 16-bit). Key hyperparameters: r (rank), alpha (scaling), target modules (typically q_proj, v_proj).

mid

What is catastrophic forgetting and how do you mitigate it in fine-tuning?

Model Answer

Catastrophic forgetting: when fine-tuning on new data, the model forgets previously learned knowledge. Mitigation strategies: Elastic Weight Consolidation (EWC) — penalize changes to important weights, Replay — mix original training data with new data, LoRA — only updates low-rank matrices so base model knowledge is preserved, low learning rate (1e-4 to 1e-5), few training epochs (1-3), domain-specific evaluation to detect forgetting. For instruction fine-tuning: include a mix of general instruction data (Alpaca, FLAN) with domain-specific data.

mid

What is the difference between full fine-tuning and PEFT methods?

Model Answer

Full fine-tuning: update all model parameters, requires massive GPU memory (e.g., 7B model needs ~28GB in bfloat16 just to store weights, plus optimizer states ~3x), but achieves best performance. PEFT (Parameter-Efficient Fine-Tuning): update only a small fraction of parameters. Methods: LoRA (inject low-rank matrices, ~0.1% of params), Prefix Tuning (prepend trainable tokens), Adapters (add small feed-forward layers between transformer blocks), Prompt Tuning (optimize continuous prompt vectors). PEFT preserves base model weights, allows serving multiple fine-tuned versions with one base model.