Why Do Transformers Need Positional Encoding?

By EulerFold / April 27, 2026
Why Do Transformers Need Positional Encoding?

The Transformer architecture is inherently permutation-invariant. This means that if you shuffle the words in a sentence, the self-attention mechanism will produce the exact same results for each word, just in a different order. To a Transformer, "The dog bit the man" and "The man bit the dog" are identical unless we provide a way to distinguish the position of each token.

Input SignalsSummation NodeTransformer StackToken EmbeddingPositional EncodingToken + PositionEncoder Layer 1

The Signal Injection

Since Transformers do not have recurrence (like RNNs) to track time, they use Positional Encodings. These are vectors of the same dimension as the input embeddings, which are added directly to the embeddings before they enter the first layer.

Input to Transformer=Token Embedding+Positional Encoding\text{Input to Transformer} = \text{Token Embedding} + \text{Positional Encoding}

This "injects" information about the token's location into its representation without changing the actual meaning of the word.

Sine and Cosine Functions

The original Transformer paper used a specific pattern of sine and cosine waves at different frequencies to create these encodings:

PE(pos,2i)=sin(pos/100002i/dmodel)PE_{(pos, 2i)} = \sin(pos / 10000^{2i/d_{model}})

PE(pos,2i+1)=cos(pos/100002i/dmodel)PE_{(pos, 2i+1)} = \cos(pos / 10000^{2i/d_{model}})

This choice is mathematically elegant because for any fixed offset kk, PEpos+kPE_{pos+k} can be represented as a linear function of PEposPE_{pos}. This allows the model to easily learn to attend to relative positions (e.g., "the word 3 places to my left").

Absolute vs. Relative Encoding

  • Absolute Positional Encoding: Every position (1, 2, 3...) has a unique, fixed vector. This is simple but struggles with sequences longer than those seen during training.
  • Relative Positional Encoding: Instead of fixed labels, the model learns the distance between tokens. This generalizes better to varying sequence lengths and is the basis for advanced techniques like ALiBi and RoPE used in state-of-the-art Large Language Models.

Without positional encoding, AI would lack the fundamental understanding of structure, grammar, and time that makes language coherent.

"Positional encodings allow a parallel architecture to perceive sequence order by injecting a unique, predictable signal into each token's embedding."

Frequently Asked Questions

Why not just use a simple counter (1, 2, 3...) for positions?+
Counters can grow very large, leading to numerical instability. If normalized to [0,1], the distance between positions would vary depending on the sequence length. Sine and cosine functions provide a bounded, consistent signal regardless of length.
What is RoPE?+
Rotary Positional Embedding (RoPE) is a modern alternative that encodes relative position by rotating the Query and Key vectors in a complex plane. It is used in models like Llama for better long-context performance.
EulerFold Intelligence

Join the EulerFold community

Track progress and collaborate on roadmaps with students worldwide.

🐢

Recommended Readings

From the Glossary
Research Decoded

The author of this article utilized generative AI (Google Gemini 3.1 Pro) to assist in part of the drafting and editing process.