Back to Education & Technical Deep Dives

Attention, Cross-Attention & Transformers

If you want the shortest useful version, attention is a learned routing rule for context. A token asks what matters to it, other tokens advertise what they contain, and the model learns how much information should flow from one position to another.

That simple sentence hides a lot of machinery, but it is the right place to start. Modern transformers work because they can repeatedly update token representations using context, in parallel, instead of pushing the whole sequence through a recurrent bottleneck one step at a time.[1][2]

And interestingly, the paper that made transformers unavoidable was a translation paper. That matters more than people sometimes realize, because cross-attention is fundamentally an alignment problem between one sequence and another.

Before transformers

Before the transformer revolution, a lot of sequence modeling work lived in recurrent systems like LSTMs and GRUs. Those of us who had been focused on these technologies were utilizing tools like LSTMs, with seq-to-seq or sequence-to-sequence modeling. Those systems absolutely mattered, and they worked, but recurrence made long-range dependencies and large-scale parallel training painful enough that the field was hungry for something cleaner.[1][2]

The core intuition

Grant Sanderson's explanation is a good one: each token begins life as an embedding, a high-dimensional vector that represents the token before much context has been absorbed.[2] Attention then lets the model refine those embeddings by deciding which other tokens are relevant. In the standard formulation, each token is projected into a query, a key, and a value. The query is what this token is looking for. The key is what this token offers. The value is the information it can pass along if it is selected.

The model compares queries and keys with dot products, turns those scores into weights with a softmax, and uses the resulting attention pattern to build weighted sums of value vectors.[1][2][3] That is the mechanism. In plainer language: find what matters, normalize the importance, then move the right information to the right place.

Self-attention, masking, and next-token prediction

When queries, keys, and values all come from the same sequence, you get self-attention. In language modeling, that self-attention is usually masked so later tokens cannot leak information backward into earlier positions. The model is allowed to use the past to predict the future, not the future to cheat about the past.[1][2]

This is one reason attention feels so different from older recurrent architectures. The model can look across the whole visible context window at once, but the mask still preserves the causal structure needed for next-token prediction.

Cross-attention and why translation is so central

Cross-attention is what happens when the queries come from one sequence and the keys and values come from another. In the original transformer setup, that meant the decoder could ask which parts of the source-language sentence mattered while generating the target-language sentence.[1] That is fundamentally a translation problem: align what is being produced now with the most relevant information from a different sequence.

This is why the original paper's translation framing still matters. Cross-attention is not some side mechanic that happened to be around at launch. It is one of the clearest expressions of what transformers are good at: learning correspondences between structured streams. That shows up in machine translation, speech translation, transcription, multimodal models, and any system where one representation needs to stay anchored to another.[1][2]

Multi-head attention and why the model uses many views

A single attention pattern can only express one style of relevance at a time. Multi-head attention runs many such patterns in parallel so the model can learn different relationships simultaneously: syntax, coreference, long-range dependence, entity matching, positional cues, and plenty of behaviors that are harder to name cleanly.[1][2][3]

That does not mean each head has a neat human-readable job description. It means the architecture gives the model enough expressive room to learn multiple contextual update rules at once.

Why attention won

A large part of the transformer story is not just that attention is expressive. It is that attention is extremely parallelizable. The architecture maps well to GPU-heavy training, which means scale can be pushed much harder than in many recurrent systems.[1][2] Once the field saw what that parallelism could buy, the rest of the stack reorganized around it.

That does not mean attention is magic or that every attention map is a faithful explanation of model thought. It means the mechanism gave the field a practical way to represent context at scale, and that changed everything.

Selected Work and Case Studies

Grant Sanderson's transformer series: highly visual explanation of embeddings, queries, keys, values, masking, and cross-attention.
Understanding Transformers and Attention: readable walkthrough of encoder-decoder structure and attention mechanics.
Transformer From Scratch Notes Diagrams: visual diagram companion for transformer components.
Speech Modeling and Voice Systems: real delivery context where transcription, translation, and device constraints make these ideas operational.