The Transformer architecture is inherently permutation-invariant. This means that if you shuffle the words in a sentence, the self-attention mechanism will produce the exact same results for each word, just in a different order. To a Transformer, "The dog bit the man" and "The man bit the dog" are identical unless we provide a way to distinguish the position of each token.
The Signal Injection
Since Transformers do not have recurrence (like RNNs) to track time, they use Positional Encodings. These are vectors of the same dimension as the input embeddings, which are added directly to the embeddings before they enter the first layer.
This "injects" information about the token's location into its representation without changing the actual meaning of the word.
Sine and Cosine Functions
The original Transformer paper used a specific pattern of sine and cosine waves at different frequencies to create these encodings:
This choice is mathematically elegant because for any fixed offset , can be represented as a linear function of . This allows the model to easily learn to attend to relative positions (e.g., "the word 3 places to my left").
Absolute vs. Relative Encoding
- Absolute Positional Encoding: Every position (1, 2, 3...) has a unique, fixed vector. This is simple but struggles with sequences longer than those seen during training.
- Relative Positional Encoding: Instead of fixed labels, the model learns the distance between tokens. This generalizes better to varying sequence lengths and is the basis for advanced techniques like ALiBi and RoPE used in state-of-the-art Large Language Models.
Without positional encoding, AI would lack the fundamental understanding of structure, grammar, and time that makes language coherent.
"Positional encodings allow a parallel architecture to perceive sequence order by injecting a unique, predictable signal into each token's embedding."
Frequently Asked Questions
Why not just use a simple counter (1, 2, 3...) for positions?+
What is RoPE?+
Join the EulerFold community
Track progress and collaborate on roadmaps with students worldwide.
Recommended Readings
The author of this article utilized generative AI (Google Gemini 3.1 Pro) to assist in part of the drafting and editing process.