Self-attention is the core engine of the Transformer architecture. It provides a mechanism for a model to "look" at other tokens in an input sequence to better understand a specific token's context. If a model is processing the word "bank," self-attention determines whether it refers to a river edge or a financial institution by weighing its relationship with surrounding words like "water" or "money."
The QKV Abstraction
To perform self-attention, the model projects each input embedding into three distinct vectors using learned weight matrices ():
- Query (): Represents the current token seeking information.
- Key (): Represents the information "tags" or labels of all tokens in the sequence.
- Value (): Represents the actual content associated with those tags.
The Mathematical Operation
The attention score is calculated by taking the dot product of the Query with all Keys. This measures the "compatibility" or relevance of every other token to the current one. The process follows the Scaled Dot-Product Attention formula:
The result of the softmax is an attention map-a set of weights between 0 and 1 that sum to 1. These weights are then used to compute a weighted sum of the Values. If a specific "Key" is highly relevant to the "Query," its corresponding "Value" will dominate the final output representation.
Multi-Head Attention
In practice, models use Multi-Head Attention. Instead of calculating one large attention pass, the model splits the vectors into multiple "heads." Each head operates in a different subspace, allowing the model to simultaneously focus on different types of relationships-such as syntactic structure (grammar) in one head and semantic meaning (intent) in another.
Computational Complexity
A critical characteristic of self-attention is its complexity, where is the sequence length. Because every token must be compared against every other token, the memory and compute requirements grow quadratically. This bottleneck is the primary driver behind research into "Sparse Attention" and "Linear Transformers," which attempt to achieve similar global context with lower overhead.
"Self-attention transforms static embeddings into dynamic, context-aware representations by calculating the compatibility between every token pair in a sequence."
Frequently Asked Questions
What is the 'Scale' in Scaled Dot-Product Attention?+
Does self-attention have a fixed 'look-back' window?+
Join the EulerFold community
Track progress and collaborate on roadmaps with students worldwide.
Recommended Readings
The author of this article utilized generative AI (Google Gemini 3.1 Pro) to assist in part of the drafting and editing process.