How Does It Work? (Part 2): Attention
How Does an LLM Work?
Why the model knows *France* matters more than *the*. The attention mechanism explained: queries, keys, values, multi-head attention.
1
Learning Material
6 pagesLesson 5 — How Does It Work? (Part 2): Attention
Understanding the Complex: How Does an LLM Work?
Back to the anchor example:
"The capital of France is ___"
The model has tokenized the sentence and converted each token to a vector. Five vectors, entering a 96-layer network. Now what?
The problem: without additional machinery, the model would treat all five tokens equally. It would average them, essentially. And an average of The, capital, of, France, and is would tell you nothing useful about what comes next.
The solution: attention.