RAG Interview Questions — Retrieval-Augmented Generation Guide | AmanAI Lab

senior

How do you evaluate a RAG system end-to-end?

Model Answer

Component-level: retrieval metrics (NDCG, MRR, Precision@K, Recall@K), generation metrics (RAGAS: faithfulness, answer relevancy, context precision, context recall). System-level: end-to-end QA accuracy, human evaluation, LLM-as-judge. RAGAS framework specifically: Faithfulness (are claims grounded in context?), Answer Relevancy (does it answer the question?), Context Precision (how much of retrieved context is relevant?), Context Recall (does retrieved context cover ground truth?). Also track: latency, cost per query, hallucination rate.

senior

What is HyDE (Hypothetical Document Embeddings) and when should you use it?

Model Answer

HyDE generates a hypothetical document that would answer the query (using an LLM), then retrieves based on the embedding of that hypothetical document rather than the query itself. The idea: the hypothetical answer is semantically closer to real answers in the vector space than the question is. Use when: queries are short/abstract and documents are long/detailed, there is a large distribution mismatch between query and document style. Downside: adds LLM inference cost before retrieval, can propagate hallucinations into the retrieval step.

mid

What chunking strategies exist for RAG and when do you use each?

Model Answer

Fixed-size: simple, split by character/token count with overlap (e.g., 512 tokens, 50 overlap). Good as baseline. Sentence-based: split on sentence boundaries, preserves semantic coherence. Paragraph-based: preserves topical coherence, variable size. Recursive: try to split on natural boundaries (paragraphs → sentences → words). Semantic chunking: embed sentences and split where similarity drops — computationally expensive but best coherence. Document-specific: parse tables, headers as separate chunks. Rule of thumb: chunk size should match the granularity of your queries.

mid

When should you use a graph database instead of a vector DB for retrieval?

Model Answer

Use graph DBs (Neo4j, Memgraph) when relationships between entities matter as much as semantic similarity — e.g., "find regulations that cite this section that affect product X". GraphRAG (Microsoft) extracts entities and relationships from documents during indexing and lets the LLM traverse the graph. Use vector DBs when queries are mostly semantic ("anything similar to this"). Hybrid: vector for retrieval, graph for filtering or expansion. Graph adds engineering complexity and slower writes, so only justified when entity relationships are the actual signal users need.

mid

What chunking strategies exist for RAG and when do you use each?

Model Answer

Fixed-size: simple, split by character/token count with overlap (e.g., 512 tokens, 50 overlap). Good as baseline. Sentence-based: split on sentence boundaries, preserves semantic coherence. Paragraph-based: preserves topical coherence, variable size. Recursive: try to split on natural boundaries (paragraphs → sentences → words). Semantic chunking: embed sentences and split where similarity drops — computationally expensive but best coherence. Document-specific: parse tables, headers as separate chunks. Rule of thumb: chunk size should match the granularity of your queries.

senior

How do you evaluate a RAG system end-to-end?

Model Answer

Component-level: retrieval metrics (NDCG, MRR, Precision@K, Recall@K), generation metrics (RAGAS: faithfulness, answer relevancy, context precision, context recall). System-level: end-to-end QA accuracy, human evaluation, LLM-as-judge. RAGAS framework specifically: Faithfulness (are claims grounded in context?), Answer Relevancy (does it answer the question?), Context Precision (how much of retrieved context is relevant?), Context Recall (does retrieved context cover ground truth?). Also track: latency, cost per query, hallucination rate.

mid

What is re-ranking in RAG and which models are used for it?

Model Answer

Re-ranking adds a second-stage precision improvement: the first stage retrieves top-K candidates fast (HNSW + cosine similarity), the second stage uses a more expensive cross-encoder to score each (query, document) pair more accurately. Cross-encoders process query and document together (full attention) vs bi-encoders which encode them separately. Models: Cohere Rerank, BGE Reranker, cross-encoder/ms-marco-MiniLM-L-6-v2. Typical pipeline: retrieve top-50 with vector search → rerank to top-5 → pass to LLM. Re-ranking adds 50-150ms but significantly improves answer quality for complex queries.

mid

What is the "lost in the middle" problem in RAG?

Model Answer

Research shows LLMs attend more strongly to information at the beginning and end of their context window, and poorly to information in the middle. In RAG, if relevant content is placed in the middle of a long context, the model may miss it. Solutions: rerank retrieved chunks so most relevant are at start/end, use context compression to remove less relevant chunks, use models with better long-context attention (e.g., Longformer), limit context size to most relevant chunks only.

senior

What is the difference between naive RAG, advanced RAG, and modular RAG?

Model Answer

Naive RAG: simple pipeline — index, retrieve, generate. Problems: poor precision/recall, lost-in-the-middle issue, redundant context. Advanced RAG: pre-retrieval (query rewriting, HyDE), retrieval improvements (hybrid search, reranking), post-retrieval (context compression, reordering). Modular RAG: treats RAG as configurable modules — any retriever, any reranker, any generator can be swapped. Examples: FLARE (iterative retrieval), Fusion-in-Decoder, Self-RAG (model decides when to retrieve).

mid

What is Retrieval-Augmented Generation (RAG) and why is it used?

Model Answer

RAG combines a retrieval system with a generative LLM. The pipeline: 1) Query → retrieve relevant documents from a knowledge base (using vector search), 2) Augment the prompt with retrieved context, 3) Generate answer conditioned on the context. Used to: reduce hallucinations by grounding responses in actual data, allow access to up-to-date/proprietary information without fine-tuning, provide source attribution. RAG is preferred over fine-tuning for dynamic knowledge bases.

mid

What is re-ranking in RAG and which models are used for it?

Model Answer

Re-ranking adds a second-stage precision improvement: the first stage retrieves top-K candidates fast (HNSW + cosine similarity), the second stage uses a more expensive cross-encoder to score each (query, document) pair more accurately. Cross-encoders process query and document together (full attention) vs bi-encoders which encode them separately. Models: Cohere Rerank, BGE Reranker, cross-encoder/ms-marco-MiniLM-L-6-v2. Typical pipeline: retrieve top-50 with vector search → rerank to top-5 → pass to LLM. Re-ranking adds 50-150ms but significantly improves answer quality for complex queries.

mid

What is Retrieval-Augmented Generation (RAG) and why is it used?

Model Answer

RAG combines a retrieval system with a generative LLM. The pipeline: 1) Query → retrieve relevant documents from a knowledge base (using vector search), 2) Augment the prompt with retrieved context, 3) Generate answer conditioned on the context. Used to: reduce hallucinations by grounding responses in actual data, allow access to up-to-date/proprietary information without fine-tuning, provide source attribution. RAG is preferred over fine-tuning for dynamic knowledge bases.

senior

What is contextual retrieval and how does Anthropic's approach work?

Model Answer

Naive chunking drops the broader document context. Contextual retrieval (Anthropic, 2024) prepends a short LLM-generated summary of where each chunk fits in its parent document before indexing. Example: instead of indexing "Revenue grew 30%", you index "Q3 2024 earnings discussion: Revenue grew 30% YoY driven by enterprise". Benefits: 35-50% reduction in retrieval failures on benchmarks. Cost: one LLM call per chunk at indexing time (prompt caching makes this ~$1/M chunks). Combine with reranking for another 30%+ improvement.

senior

What is the difference between naive RAG, advanced RAG, and modular RAG?

Model Answer

Naive RAG: simple pipeline — index, retrieve, generate. Problems: poor precision/recall, lost-in-the-middle issue, redundant context. Advanced RAG: pre-retrieval (query rewriting, HyDE), retrieval improvements (hybrid search, reranking), post-retrieval (context compression, reordering). Modular RAG: treats RAG as configurable modules — any retriever, any reranker, any generator can be swapped. Examples: FLARE (iterative retrieval), Fusion-in-Decoder, Self-RAG (model decides when to retrieve).

mid

What is the "lost in the middle" problem in RAG?

Model Answer

Research shows LLMs attend more strongly to information at the beginning and end of their context window, and poorly to information in the middle. In RAG, if relevant content is placed in the middle of a long context, the model may miss it. Solutions: rerank retrieved chunks so most relevant are at start/end, use context compression to remove less relevant chunks, use models with better long-context attention (e.g., Longformer), limit context size to most relevant chunks only.

senior

What is HyDE (Hypothetical Document Embeddings) and when should you use it?

Model Answer

HyDE generates a hypothetical document that would answer the query (using an LLM), then retrieves based on the embedding of that hypothetical document rather than the query itself. The idea: the hypothetical answer is semantically closer to real answers in the vector space than the question is. Use when: queries are short/abstract and documents are long/detailed, there is a large distribution mismatch between query and document style. Downside: adds LLM inference cost before retrieval, can propagate hallucinations into the retrieval step.