Concept
Explain the transformer architecture
Tap to reveal answer
Answer
Stack of identical layers, each with: multi-head self-attention (attend to all positions), feed-forward network (2 linear layers + activation), residual connections, layer normalization. Encoder: bidirectional. Decoder: causal + cross-attention to encoder. All ops are parallelizable.