Self-attention is the core engine of the Transformer architecture. It provides a mechanism for a model to "look" at other tokens in an input sequence to better understand a specific token's context. If a model is processing the word "bank," self-attention determines whether it refers to a river edge or a financial institution by weighing its relationship with surrounding words like "water" or "money."

The QKV Abstraction

To perform self-attention, the model projects each input embedding into three distinct vectors using learned weight matrices ( $W^Q, W^K, W^V$ ):

Query ( $Q$ ): Represents the current token seeking information.
Key ( $K$ ): Represents the information "tags" or labels of all tokens in the sequence.
Value ( $V$ ): Represents the actual content associated with those tags.

The Mathematical Operation

The attention score is calculated by taking the dot product of the Query with all Keys. This measures the "compatibility" or relevance of every other token to the current one. The process follows the Scaled Dot-Product Attention formula:

$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$

The result of the softmax is an attention map-a set of weights between 0 and 1 that sum to 1. These weights are then used to compute a weighted sum of the Values. If a specific "Key" is highly relevant to the "Query," its corresponding "Value" will dominate the final output representation.

Multi-Head Attention

In practice, models use Multi-Head Attention. Instead of calculating one large attention pass, the model splits the $Q, K, V$ vectors into multiple "heads." Each head operates in a different subspace, allowing the model to simultaneously focus on different types of relationships-such as syntactic structure (grammar) in one head and semantic meaning (intent) in another.

Computational Complexity

A critical characteristic of self-attention is its $O(n^2)$ complexity, where $n$ is the sequence length. Because every token must be compared against every other token, the memory and compute requirements grow quadratically. This bottleneck is the primary driver behind research into "Sparse Attention" and "Linear Transformers," which attempt to achieve similar global context with lower overhead.

How Does the Self-Attention Mechanism Work?

The QKV Abstraction

The Mathematical Operation

Multi-Head Attention

Computational Complexity

Frequently Asked Questions

Join the EulerFold community

Recommended Readings