Vector Database Interview Questions — ANN, HNSW, FAISS | AmanAI Lab

senior

What is the difference between semantic search and keyword search, and when do you combine them?

Model Answer

Keyword search (BM25): exact term matching with TF-IDF weighting, fast, no embedding needed, great for precise queries. Semantic search: vector similarity, captures meaning/synonyms, no exact term matching needed. Hybrid search: combine both scores (typically: score = (1-α)×BM25 + α×cosine_sim, or use RRF — Reciprocal Rank Fusion to merge ranked lists). Use hybrid when: queries can be either precise ("GET /api/v2/users error 404") or semantic ("how do I handle authentication errors"), or when neither alone is sufficient. Elasticsearch supports hybrid natively, as does Weaviate, Qdrant, and Pinecone.

mid

What is HNSW and why is it the dominant algorithm for vector search?

Model Answer

HNSW (Hierarchical Navigable Small World) is an approximate nearest neighbor algorithm using a multi-layer graph. Construction: each vector is added to a hierarchical graph where upper layers are sparse (long-range connections) and lower layers are dense (local connections). Search: start at the top layer, greedily navigate to closest node, descend to lower layers. Properties: O(log N) search time, high recall (>99% with ef parameter), supports incremental updates. Used by: Qdrant, Weaviate, Pinecone, pgvector. Trade-offs: high memory usage (graph structure), slower build time than IVF, but faster and more accurate search.

mid

What are the trade-offs between different embedding models for RAG?

Model Answer

Small fast models (all-MiniLM-L6-v2, 22M params, 384-dim): ~10ms inference, good for general purpose, free. Medium models (BGE-M3, E5-large): better quality, multi-lingual support, 300-500ms. Large/API models (text-embedding-3-large from OpenAI, 3072-dim): highest quality, ~50ms API latency, costs money. Key factors: embedding dimension (higher = better quality but more memory), max sequence length (some truncate at 512 tokens), multilingual support, domain-specific performance. For RAG: BGE-M3 or E5-large are popular open-source choices. Always evaluate on your specific domain.

mid

What is binary quantization and when is it worth using?

Model Answer

Binary quantization compresses each float embedding into a single bit per dimension (sign of the value). A 1024-dim float32 embedding (4096 bytes) becomes 1024 bits = 128 bytes — a 32× reduction. Distance metric becomes Hamming distance, computable in nanoseconds with bitwise ops. Quality loss: 1-3 percentage points of recall, recoverable by reranking the top-100 binary candidates with full-precision vectors. Worth it when: index size exceeds RAM (binary fits where float32 doesn't), query throughput matters more than last-percent quality. Qdrant, Milvus, Weaviate all support it. Combined with reranking, often matches full-precision retrieval at 10× the throughput.

senior

What is matryoshka representation learning and why does it matter?

Model Answer

Matryoshka embeddings (Aditya Kusupati et al., 2022) train a model so that PREFIXES of the embedding (first 64, 128, 256 dims) are themselves usable embeddings of decreasing quality. One model serves many resolution levels — same vector, multiple uses. OpenAI's text-embedding-3 series uses this. Why it matters: at retrieval time, do a coarse search with truncated 256-dim vectors (fast, less RAM) to get top-1000 candidates, then rerank with full-resolution 3072-dim vectors. Storage saved AND latency reduced. Without matryoshka, training separate models per dimension multiplies cost; matryoshka does it in one training run.

mid

What is HNSW and why is it the dominant algorithm for vector search?

Model Answer

HNSW (Hierarchical Navigable Small World) is an approximate nearest neighbor algorithm using a multi-layer graph. Construction: each vector is added to a hierarchical graph where upper layers are sparse (long-range connections) and lower layers are dense (local connections). Search: start at the top layer, greedily navigate to closest node, descend to lower layers. Properties: O(log N) search time, high recall (>99% with ef parameter), supports incremental updates. Used by: Qdrant, Weaviate, Pinecone, pgvector. Trade-offs: high memory usage (graph structure), slower build time than IVF, but faster and more accurate search.

mid

What are the trade-offs between different embedding models for RAG?

Model Answer

Small fast models (all-MiniLM-L6-v2, 22M params, 384-dim): ~10ms inference, good for general purpose, free. Medium models (BGE-M3, E5-large): better quality, multi-lingual support, 300-500ms. Large/API models (text-embedding-3-large from OpenAI, 3072-dim): highest quality, ~50ms API latency, costs money. Key factors: embedding dimension (higher = better quality but more memory), max sequence length (some truncate at 512 tokens), multilingual support, domain-specific performance. For RAG: BGE-M3 or E5-large are popular open-source choices. Always evaluate on your specific domain.

senior

What is the difference between semantic search and keyword search, and when do you combine them?

Model Answer

Keyword search (BM25): exact term matching with TF-IDF weighting, fast, no embedding needed, great for precise queries. Semantic search: vector similarity, captures meaning/synonyms, no exact term matching needed. Hybrid search: combine both scores (typically: score = (1-α)×BM25 + α×cosine_sim, or use RRF — Reciprocal Rank Fusion to merge ranked lists). Use hybrid when: queries can be either precise ("GET /api/v2/users error 404") or semantic ("how do I handle authentication errors"), or when neither alone is sufficient. Elasticsearch supports hybrid natively, as does Weaviate, Qdrant, and Pinecone.

senior

When would you use Pinecone vs Qdrant vs pgvector for a production RAG system?

Model Answer

Pinecone: fully managed, zero ops overhead, auto-scaling, good for teams that want to focus on product not infrastructure. Expensive at scale, limited customization, vendor lock-in. Qdrant: open-source, self-hostable or managed, rich filtering (payload filters), good performance, active development. Good balance of control and features for medium-large scale. pgvector: extends PostgreSQL with vector operations, ideal when you already use Postgres and want simplicity, or need ACID transactions with vector search. Limitations: HNSW performance slightly lower than dedicated DBs, scaling requires Postgres expertise. Rule: Pinecone for speed, Qdrant for cost+control, pgvector for simplicity.

senior

When would you use Pinecone vs Qdrant vs pgvector for a production RAG system?

Model Answer

Pinecone: fully managed, zero ops overhead, auto-scaling, good for teams that want to focus on product not infrastructure. Expensive at scale, limited customization, vendor lock-in. Qdrant: open-source, self-hostable or managed, rich filtering (payload filters), good performance, active development. Good balance of control and features for medium-large scale. pgvector: extends PostgreSQL with vector operations, ideal when you already use Postgres and want simplicity, or need ACID transactions with vector search. Limitations: HNSW performance slightly lower than dedicated DBs, scaling requires Postgres expertise. Rule: Pinecone for speed, Qdrant for cost+control, pgvector for simplicity.

Vector DB Interview Questions