Question 1

How do you prepare a dataset for instruction fine-tuning?

Accepted Answer

Format: convert to instruction-following format — system prompt + user instruction + assistant response. Common formats: ChatML (<|im_start|>system ...<|im_end|>), Llama chat format, Alpaca format. Data quality checklist: remove duplicates, filter short/low-quality responses, ensure diverse topics, balance difficulty levels. Data sources: Alpaca (GPT-4 generated), OpenAssistant (human conversations), FLAN (academic tasks), ShareGPT (real ChatGPT conversations). Min size: 1K examples for domain fine-tuning, 10K+ for general instruction following. Always evaluate on a held-out set.

Question 2

What is catastrophic forgetting and how do you mitigate it in fine-tuning?

Accepted Answer

Catastrophic forgetting: when fine-tuning on new data, the model forgets previously learned knowledge. Mitigation strategies: Elastic Weight Consolidation (EWC) — penalize changes to important weights, Replay — mix original training data with new data, LoRA — only updates low-rank matrices so base model knowledge is preserved, low learning rate (1e-4 to 1e-5), few training epochs (1-3), domain-specific evaluation to detect forgetting. For instruction fine-tuning: include a mix of general instruction data (Alpaca, FLAN) with domain-specific data.

Question 3

What is reranking in RAG?

Accepted Answer

Reranking is a second pass scoring of retrieved documents to improve relevance before passing to LLM. Vector search returns similar but not always the most relevant documents. A reranker like Cohere Rerank or CrossEncoder rescores top-20 chunks and selects best 3 to 5 for the LLM. Significantly improves answer quality.

Question 4

What is the difference between semantic search and keyword search, and when do you combine them?

Accepted Answer

Keyword search (BM25): exact term matching with TF-IDF weighting, fast, no embedding needed, great for precise queries. Semantic search: vector similarity, captures meaning/synonyms, no exact term matching needed. Hybrid search: combine both scores (typically: score = (1-α)×BM25 + α×cosine_sim, or use RRF — Reciprocal Rank Fusion to merge ranked lists). Use hybrid when: queries can be either precise ("GET /api/v2/users error 404") or semantic ("how do I handle authentication errors"), or when neither alone is sufficient. Elasticsearch supports hybrid natively, as does Weaviate, Qdrant, and Pinecone.

Question 5

What is agent memory and what are the different types?

Accepted Answer

Short-term memory: the current conversation context window — shared with the LLM in each call. Long-term memory: external storage (databases, vector stores) that persists across sessions — agent retrieves relevant memories. Episodic memory: past interactions and experiences. Semantic memory: facts and knowledge. Working memory: intermediate results during a task. Implementation: store memories as embeddings in a vector DB, retrieve k most relevant using similarity search before each LLM call. Tools like MemGPT implement "self-editing" memory where the agent manages what to remember.

Question 6

What is the difference between full fine-tuning and PEFT?

Accepted Answer

Full fine-tuning updates all model parameters which is expensive and requires large compute. PEFT or Parameter Efficient Fine-Tuning updates only a small subset of parameters achieving similar results at a fraction of the cost. LoRA QLoRA and prefix tuning are popular PEFT methods. PEFT is the standard approach in 2026.

Question 7

What is speculative decoding and how does it speed up inference?

Accepted Answer

Speculative decoding uses a small "draft" model to generate candidate tokens quickly, then uses the larger target model to verify multiple tokens in parallel. Since the target model can process multiple tokens in one forward pass (verification is faster than generation), you get significant speedups when the draft model's predictions are often correct. Deepmind showed 2-3x speedup with minimal quality loss. The technique is now used in production at Anthropic and other labs.

Question 8

What is DPO (Direct Preference Optimization) and how does it compare to RLHF?

Accepted Answer

DPO trains directly from human preference data without a separate reward model or RL. It reformulates RLHF as a classification problem: given pairs of (chosen, rejected) responses, maximize likelihood of chosen over rejected with a KL constraint. Advantages over RLHF: no reward model training, no PPO (which is unstable), simpler pipeline, computationally cheaper. The objective: maximize log(σ(β × (log(π/π_ref)(chosen) - log(π/π_ref)(rejected)))). Used by Llama 2 chat and many open models. RLHF still has edge in some benchmarks but DPO is the practical choice.

Question 9

What is the feed-forward network in a transformer?

Accepted Answer

The feed-forward network is a position-wise two-layer MLP applied to each token independently after the attention layer. It typically has a hidden dimension 4 times larger than the model dimension. Modern LLMs use SwiGLU activation instead of ReLU. The FFN stores most of the factual knowledge of the model.

Question 10

What is the HuggingFace Transformers library?

Accepted Answer

HuggingFace Transformers is the most popular library for working with pre-trained language models. It provides a unified API to load fine-tune and run inference on thousands of models. Key classes are AutoModelForCausalLM AutoTokenizer and pipeline. Integrates with PEFT for efficient fine-tuning and Datasets for data loading.

Question 11

What are the different types of memory in AI agents?

Accepted Answer

Agents have four memory types. Sensory memory is the raw input. Short-term memory is the context window which is temporary. Long-term memory uses a vector database to persist information across sessions. Episodic memory stores past interactions and retrieves them when relevant. Production agents combine short-term context with long-term vector DB recall.

Question 12

What is Grouped Query Attention (GQA) and why is it used in modern LLMs?

Accepted Answer

GQA is a compromise between Multi-Head Attention (MHA) and Multi-Query Attention (MQA). In MHA: each head has its own Q, K, V matrices. In MQA: all heads share a single K and V. In GQA: heads are grouped, and each group shares K and V — e.g., 32 query heads with 8 KV heads. Benefit: reduces KV-cache size during inference (critical bottleneck), allows faster inference with minimal quality loss. Used in Llama 2 70B, Llama 3, Mistral, Gemma. Typical configuration: 32Q heads, 8KV heads (4:1 ratio).

Question 13

What frameworks are used to build AI agents?

Accepted Answer

Popular agent frameworks include LangChain for general agents LangGraph for stateful multi-agent systems LlamaIndex for data-centric agents AutoGen for multi-agent conversations CrewAI for role-based agent teams and Swarm for lightweight agents. LangGraph is recommended for production use in 2026.

Question 14

What is Constitutional AI and how does Anthropic use it?

Accepted Answer

Constitutional AI (CAI) is a training method where the model is guided by a set of principles (a "constitution") rather than just human feedback. The process: 1) Generate responses, 2) Have the AI critique its own responses against the constitution, 3) Revise based on critiques, 4) Train on the revised responses. This reduces reliance on human labelers for identifying harmful content. Anthropic uses CAI to train Claude, making it helpful, harmless, and honest without needing extensive human feedback on every harmful output.

Question 15

What is the difference between Pinecone and pgvector?

Accepted Answer

Pinecone is a fully managed cloud vector database with zero operational overhead. Best for teams without infrastructure expertise or needing to scale to billions of vectors easily. pgvector is a PostgreSQL extension that adds vector search to your existing Postgres database. Best for teams already using Postgres who want to avoid a new service. pgvector with HNSW index handles up to 50M vectors effectively.

Question 16

How do you scale an LLM application to 1 million users?

Accepted Answer

Scale LLM apps with these approaches. Horizontal scaling with stateless LLM servers behind a load balancer. Semantic cache with Redis to reduce LLM calls by 30 to 40 percent. Model routing to send simple queries to cheap models. Async processing with message queues for non-realtime tasks. Autoscaling GPU pods based on queue depth using Kubernetes and KEDA. CDN for static assets.

Question 17

What is GRPO (Group Relative Policy Optimization) and how does it enable LLMs to reason?

Accepted Answer

GRPO (DeepSeek-R1) is a variant of PPO that eliminates the need for a separate value function model. For a given prompt, generate G responses, compute rewards for each, normalize relative to the group mean. This group comparison acts as a baseline for variance reduction. Combined with a "thinking" format (model generates reasoning before answer), GRPO trains models to reason step-by-step on math and coding problems. DeepSeek-R1 matches o1-preview performance using GRPO. Advantage over PPO: no value model = simpler, less memory, more stable training. Advantage over DPO: works with verifiable rewards (math correctness), not just human preferences.

Question 18

What is the difference between SFT and RLHF?

Accepted Answer

SFT or Supervised Fine-Tuning trains the model on labeled input-output pairs to teach it to follow instructions. RLHF uses human preference feedback to train a reward model then uses RL to optimize the LLM to maximize human preferences. SFT is the first step and RLHF is the alignment step. Both are used in training production LLMs.

Question 19

What is the champion-challenger pattern in MLOps?

Accepted Answer

Champion-challenger involves running a new model on a small percentage of traffic typically 5 to 10 percent while the current production model handles the rest. The new challenger model is evaluated on real traffic metrics. If it performs better it becomes the new champion. This enables safe gradual model rollouts.

Question 20

What tools can AI agents use?

Accepted Answer

AI agents can use many tools including web search APIs code execution calculators database queries file system operations browser automation email and calendar tools. Tools are functions the agent can call. The agent decides which tool to use based on the task and interprets the tool output to plan next steps.

Question 21

What is tool/function calling and how does it work in modern LLMs?

Accepted Answer

Tool calling (OpenAI function calling, Anthropic tool use): the model is given JSON schemas for available functions and can choose to call them with structured arguments. Mechanism: 1) Include tool schemas in the API request, 2) Model outputs a tool_call instead of text when it wants to use a tool, 3) You execute the tool, 4) Return the result in the next message, 5) Model uses the result to generate a response. Parallel tool calling: GPT-4 can call multiple tools simultaneously (returns a list of tool_calls). Better than unstructured function calling because it's JSON-valid and structured.

Question 22

What is chain-of-thought prompting and when should you use it?

Accepted Answer

Chain-of-thought (CoT) prompting encourages the model to show its reasoning steps before giving a final answer. Example: "Q: Roger has 5 tennis balls... Let's think step by step: Roger starts with 5 balls, buys 2 cans of 3 balls each, so 2×3=6 new balls, 5+6=11 total. A: 11." CoT significantly improves performance on math, reasoning, and logical tasks. Use when: the task requires multi-step reasoning, arithmetic, or logical deduction. Zero-shot CoT: just append "Let's think step by step" to your prompt.

Question 23

What is the ReAct framework for AI agents?

Accepted Answer

ReAct (Reasoning + Acting) interleaves reasoning traces and actions. The agent: 1) Reasons about what to do next (Thought), 2) Takes an action (Action + tool call), 3) Observes the result (Observation), 4) Repeats until task is complete. Advantages: reasoning explains agent behavior, allows course correction. Key components: a prompt that describes available tools and the ReAct format, tool implementations (search, calculator, code executor), a stopping condition.

Question 24

How do you evaluate a RAG system?

Accepted Answer

RAG evaluation uses RAGAS framework with 4 metrics. Faithfulness measures if the answer is grounded in retrieved context. Answer Relevance measures if the answer addresses the question. Context Recall measures if retrieval found all relevant information. Context Precision measures if retrieved chunks are actually relevant. Use LLM as judge for automated evaluation at scale.

Question 25

What is agent memory and what are the different types?

Accepted Answer

Short-term memory: the current conversation context window — shared with the LLM in each call. Long-term memory: external storage (databases, vector stores) that persists across sessions — agent retrieves relevant memories. Episodic memory: past interactions and experiences. Semantic memory: facts and knowledge. Working memory: intermediate results during a task. Implementation: store memories as embeddings in a vector DB, retrieve k most relevant using similarity search before each LLM call. Tools like MemGPT implement "self-editing" memory where the agent manages what to remember.

AI/ML Interview Question Bank