Vector databases enable semantic search and similarity at scale. Interviews cover indexing algorithms, distance metrics, approximate nearest neighbor, and production trade-offs.
Key Concepts to Know
Practice Vector DB with AI
Timed session with instant scoring, voice support, and model answers.
10 Interview Questions
Browse all topics →What is the difference between semantic search and keyword search, and when do you combine them?
Model Answer
Keyword search (BM25): exact term matching with TF-IDF weighting, fast, no embedding needed, great for precise queries. Semantic search: vector similarity, captures meaning/synonyms, no exact term matching needed. Hybrid search: combine both scores (typically: score = (1-α)×BM25 + α×cosine_sim, or use RRF — Reciprocal Rank Fusion to merge ranked lists). Use hybrid when: queries can be either precise ("GET /api/v2/users error 404") or semantic ("how do I handle authentication errors"), or when neither alone is sufficient. Elasticsearch supports hybrid natively, as does Weaviate, Qdrant, and Pinecone.
What is HNSW and why is it the dominant algorithm for vector search?
Model Answer
HNSW (Hierarchical Navigable Small World) is an approximate nearest neighbor algorithm using a multi-layer graph. Construction: each vector is added to a hierarchical graph where upper layers are sparse (long-range connections) and lower layers are dense (local connections). Search: start at the top layer, greedily navigate to closest node, descend to lower layers. Properties: O(log N) search time, high recall (>99% with ef parameter), supports incremental updates. Used by: Qdrant, Weaviate, Pinecone, pgvector. Trade-offs: high memory usage (graph structure), slower build time than IVF, but faster and more accurate search.
What are the trade-offs between different embedding models for RAG?
Model Answer
Small fast models (all-MiniLM-L6-v2, 22M params, 384-dim): ~10ms inference, good for general purpose, free. Medium models (BGE-M3, E5-large): better quality, multi-lingual support, 300-500ms. Large/API models (text-embedding-3-large from OpenAI, 3072-dim): highest quality, ~50ms API latency, costs money. Key factors: embedding dimension (higher = better quality but more memory), max sequence length (some truncate at 512 tokens), multilingual support, domain-specific performance. For RAG: BGE-M3 or E5-large are popular open-source choices. Always evaluate on your specific domain.
What is binary quantization and when is it worth using?
Model Answer
Binary quantization compresses each float embedding into a single bit per dimension (sign of the value). A 1024-dim float32 embedding (4096 bytes) becomes 1024 bits = 128 bytes — a 32× reduction. Distance metric becomes Hamming distance, computable in nanoseconds with bitwise ops. Quality loss: 1-3 percentage points of recall, recoverable by reranking the top-100 binary candidates with full-precision vectors. Worth it when: index size exceeds RAM (binary fits where float32 doesn't), query throughput matters more than last-percent quality. Qdrant, Milvus, Weaviate all support it. Combined with reranking, often matches full-precision retrieval at 10× the throughput.
What is matryoshka representation learning and why does it matter?
Model Answer
Matryoshka embeddings (Aditya Kusupati et al., 2022) train a model so that PREFIXES of the embedding (first 64, 128, 256 dims) are themselves usable embeddings of decreasing quality. One model serves many resolution levels — same vector, multiple uses. OpenAI's text-embedding-3 series uses this. Why it matters: at retrieval time, do a coarse search with truncated 256-dim vectors (fast, less RAM) to get top-1000 candidates, then rerank with full-resolution 3072-dim vectors. Storage saved AND latency reduced. Without matryoshka, training separate models per dimension multiplies cost; matryoshka does it in one training run.
What is HNSW and why is it the dominant algorithm for vector search?
Model Answer
HNSW (Hierarchical Navigable Small World) is an approximate nearest neighbor algorithm using a multi-layer graph. Construction: each vector is added to a hierarchical graph where upper layers are sparse (long-range connections) and lower layers are dense (local connections). Search: start at the top layer, greedily navigate to closest node, descend to lower layers. Properties: O(log N) search time, high recall (>99% with ef parameter), supports incremental updates. Used by: Qdrant, Weaviate, Pinecone, pgvector. Trade-offs: high memory usage (graph structure), slower build time than IVF, but faster and more accurate search.
What are the trade-offs between different embedding models for RAG?
Model Answer
Small fast models (all-MiniLM-L6-v2, 22M params, 384-dim): ~10ms inference, good for general purpose, free. Medium models (BGE-M3, E5-large): better quality, multi-lingual support, 300-500ms. Large/API models (text-embedding-3-large from OpenAI, 3072-dim): highest quality, ~50ms API latency, costs money. Key factors: embedding dimension (higher = better quality but more memory), max sequence length (some truncate at 512 tokens), multilingual support, domain-specific performance. For RAG: BGE-M3 or E5-large are popular open-source choices. Always evaluate on your specific domain.
What is the difference between semantic search and keyword search, and when do you combine them?
Model Answer
Keyword search (BM25): exact term matching with TF-IDF weighting, fast, no embedding needed, great for precise queries. Semantic search: vector similarity, captures meaning/synonyms, no exact term matching needed. Hybrid search: combine both scores (typically: score = (1-α)×BM25 + α×cosine_sim, or use RRF — Reciprocal Rank Fusion to merge ranked lists). Use hybrid when: queries can be either precise ("GET /api/v2/users error 404") or semantic ("how do I handle authentication errors"), or when neither alone is sufficient. Elasticsearch supports hybrid natively, as does Weaviate, Qdrant, and Pinecone.
When would you use Pinecone vs Qdrant vs pgvector for a production RAG system?
Model Answer
Pinecone: fully managed, zero ops overhead, auto-scaling, good for teams that want to focus on product not infrastructure. Expensive at scale, limited customization, vendor lock-in. Qdrant: open-source, self-hostable or managed, rich filtering (payload filters), good performance, active development. Good balance of control and features for medium-large scale. pgvector: extends PostgreSQL with vector operations, ideal when you already use Postgres and want simplicity, or need ACID transactions with vector search. Limitations: HNSW performance slightly lower than dedicated DBs, scaling requires Postgres expertise. Rule: Pinecone for speed, Qdrant for cost+control, pgvector for simplicity.
When would you use Pinecone vs Qdrant vs pgvector for a production RAG system?
Model Answer
Pinecone: fully managed, zero ops overhead, auto-scaling, good for teams that want to focus on product not infrastructure. Expensive at scale, limited customization, vendor lock-in. Qdrant: open-source, self-hostable or managed, rich filtering (payload filters), good performance, active development. Good balance of control and features for medium-large scale. pgvector: extends PostgreSQL with vector operations, ideal when you already use Postgres and want simplicity, or need ACID transactions with vector search. Limitations: HNSW performance slightly lower than dedicated DBs, scaling requires Postgres expertise. Rule: Pinecone for speed, Qdrant for cost+control, pgvector for simplicity.
Related Topics