Text processing, word embeddings, sequence models, BERT and language understanding fundamentals.
Key Concepts to Know
Practice NLP with AI
Timed session with instant scoring, voice support, and model answers.
5 Interview Questions
Browse all topics →What is the difference between BERT and sentence transformers for semantic search?
Model Answer
BERT (and its variants): outputs token-level embeddings, designed for classification/NER/QA. To get sentence embedding: average pool or use [CLS] token — but these were NOT trained for semantic similarity, so they perform poorly for retrieval. Sentence Transformers (SBERT): fine-tuned specifically for semantic textual similarity using siamese network architecture on NLI/STS datasets. Optimized for: fast similarity computation, sentence-level tasks. BEIR benchmark shows SBERT models (GTE, E5, BGE) significantly outperform BERT on retrieval. For RAG: always use sentence transformers, not raw BERT.
Why do tokenizers tokenize different languages with very different efficiency?
Model Answer
BPE / Unigram tokenizers learn merges from frequency in the training corpus. If English is 90% of the corpus, "the" is one token but a common Hindi word might be 5-8 tokens because its subwords are rarer in the data. Consequences: non-English inference is 2-4× more expensive per request, context window holds less information, latency is worse. Solutions: train a multilingual tokenizer with balanced corpus (Llama 3 uses 128K vocab partly for this), use language-specific models for non-English, or strip/translate inputs at the boundary. Bottom line: token count is not symmetric across languages.
What is the difference between BERT and sentence transformers for semantic search?
Model Answer
BERT (and its variants): outputs token-level embeddings, designed for classification/NER/QA. To get sentence embedding: average pool or use [CLS] token — but these were NOT trained for semantic similarity, so they perform poorly for retrieval. Sentence Transformers (SBERT): fine-tuned specifically for semantic textual similarity using siamese network architecture on NLI/STS datasets. Optimized for: fast similarity computation, sentence-level tasks. BEIR benchmark shows SBERT models (GTE, E5, BGE) significantly outperform BERT on retrieval. For RAG: always use sentence transformers, not raw BERT.
What is Byte Pair Encoding (BPE) and why is it used for tokenization?
Model Answer
BPE starts with a character vocabulary and iteratively merges the most frequent adjacent byte/character pairs. Result: a vocabulary of subwords that efficiently represents any text. Benefits: handles OOV (out-of-vocabulary) words by splitting them into known subwords, balances between character-level (handles everything) and word-level (inefficient vocabulary), good compression ratio. GPT models use BPE-based tokenizers (tiktoken). Example: "tokenization" → ["token", "iz", "ation"]. Trade-offs: language-dependent (English-optimized tokenizers tokenize other languages less efficiently), can create semantically odd splits.
What is Byte Pair Encoding (BPE) and why is it used for tokenization?
Model Answer
BPE starts with a character vocabulary and iteratively merges the most frequent adjacent byte/character pairs. Result: a vocabulary of subwords that efficiently represents any text. Benefits: handles OOV (out-of-vocabulary) words by splitting them into known subwords, balances between character-level (handles everything) and word-level (inefficient vocabulary), good compression ratio. GPT models use BPE-based tokenizers (tiktoken). Example: "tokenization" → ["token", "iz", "ation"]. Trade-offs: language-dependent (English-optimized tokenizers tokenize other languages less efficiently), can create semantically odd splits.
Related Topics