NLP Interview Questions — Word2Vec, BERT, Text Classification | AmanAI Lab

senior

What is the difference between BERT and sentence transformers for semantic search?

Model Answer

BERT (and its variants): outputs token-level embeddings, designed for classification/NER/QA. To get sentence embedding: average pool or use [CLS] token — but these were NOT trained for semantic similarity, so they perform poorly for retrieval. Sentence Transformers (SBERT): fine-tuned specifically for semantic textual similarity using siamese network architecture on NLI/STS datasets. Optimized for: fast similarity computation, sentence-level tasks. BEIR benchmark shows SBERT models (GTE, E5, BGE) significantly outperform BERT on retrieval. For RAG: always use sentence transformers, not raw BERT.

mid

Why do tokenizers tokenize different languages with very different efficiency?

Model Answer

BPE / Unigram tokenizers learn merges from frequency in the training corpus. If English is 90% of the corpus, "the" is one token but a common Hindi word might be 5-8 tokens because its subwords are rarer in the data. Consequences: non-English inference is 2-4× more expensive per request, context window holds less information, latency is worse. Solutions: train a multilingual tokenizer with balanced corpus (Llama 3 uses 128K vocab partly for this), use language-specific models for non-English, or strip/translate inputs at the boundary. Bottom line: token count is not symmetric across languages.

senior

What is the difference between BERT and sentence transformers for semantic search?

Model Answer

BERT (and its variants): outputs token-level embeddings, designed for classification/NER/QA. To get sentence embedding: average pool or use [CLS] token — but these were NOT trained for semantic similarity, so they perform poorly for retrieval. Sentence Transformers (SBERT): fine-tuned specifically for semantic textual similarity using siamese network architecture on NLI/STS datasets. Optimized for: fast similarity computation, sentence-level tasks. BEIR benchmark shows SBERT models (GTE, E5, BGE) significantly outperform BERT on retrieval. For RAG: always use sentence transformers, not raw BERT.

mid

What is Byte Pair Encoding (BPE) and why is it used for tokenization?

Model Answer

BPE starts with a character vocabulary and iteratively merges the most frequent adjacent byte/character pairs. Result: a vocabulary of subwords that efficiently represents any text. Benefits: handles OOV (out-of-vocabulary) words by splitting them into known subwords, balances between character-level (handles everything) and word-level (inefficient vocabulary), good compression ratio. GPT models use BPE-based tokenizers (tiktoken). Example: "tokenization" → ["token", "iz", "ation"]. Trade-offs: language-dependent (English-optimized tokenizers tokenize other languages less efficiently), can create semantically odd splits.

mid

What is Byte Pair Encoding (BPE) and why is it used for tokenization?

Model Answer

BPE starts with a character vocabulary and iteratively merges the most frequent adjacent byte/character pairs. Result: a vocabulary of subwords that efficiently represents any text. Benefits: handles OOV (out-of-vocabulary) words by splitting them into known subwords, balances between character-level (handles everything) and word-level (inefficient vocabulary), good compression ratio. GPT models use BPE-based tokenizers (tiktoken). Example: "tokenization" → ["token", "iz", "ation"]. Trade-offs: language-dependent (English-optimized tokenizers tokenize other languages less efficiently), can create semantically odd splits.