The Transformer architecture is inherently permutation-invariant. This means that if you shuffle the words in a sentence, the self-attention mechanism will produce the exact same results for each word, just in a different order. To a Transformer, "The dog bit the man" and "The man bit the dog" are identical unless we provide a way to distinguish the position of each token.

The Signal Injection

Since Transformers do not have recurrence (like RNNs) to track time, they use Positional Encodings. These are vectors of the same dimension as the input embeddings, which are added directly to the embeddings before they enter the first layer.

$\text{Input to Transformer} = \text{Token Embedding} + \text{Positional Encoding}$

This "injects" information about the token's location into its representation without changing the actual meaning of the word.

Sine and Cosine Functions

The original Transformer paper used a specific pattern of sine and cosine waves at different frequencies to create these encodings:

$PE_{(pos, 2i)} = \sin(pos / 10000^{2i/d_{model}})$

$PE_{(pos, 2i+1)} = \cos(pos / 10000^{2i/d_{model}})$

This choice is mathematically elegant because for any fixed offset $k$ , $PE_{pos+k}$ can be represented as a linear function of $PE_{pos}$ . This allows the model to easily learn to attend to relative positions (e.g., "the word 3 places to my left").

Absolute vs. Relative Encoding

Absolute Positional Encoding: Every position (1, 2, 3...) has a unique, fixed vector. This is simple but struggles with sequences longer than those seen during training.
Relative Positional Encoding: Instead of fixed labels, the model learns the distance between tokens. This generalizes better to varying sequence lengths and is the basis for advanced techniques like ALiBi and RoPE used in state-of-the-art Large Language Models.

Without positional encoding, AI would lack the fundamental understanding of structure, grammar, and time that makes language coherent.

Why Do Transformers Need Positional Encoding?

The Signal Injection

Sine and Cosine Functions

Absolute vs. Relative Encoding

Frequently Asked Questions

Join the EulerFold community

Recommended Readings