Neural networks cannot "see" letters. They only understand numbers. Tokenization is the critical first step in the AI pipeline-it is the process of breaking down raw text into a sequence of discrete units called Tokens, which are then mapped to unique numerical IDs.
The Subword Revolution
Early AI models used word-level tokenization. However, if the model encountered a word it hadn't seen before (an "Out of Vocabulary" or OOV word), it would break. Modern models use Byte Pair Encoding (BPE) or similar subword algorithms.
Instead of treating "unbelievable" as one token, the tokenizer might break it into un, believ, and able. This allows the model to understand the meaning of new words by looking at their constituent parts.
The Vocabulary
The "Vocabulary" of a model is a fixed list of all the tokens it knows. GPT-4, for example, has a vocabulary of roughly 100,000 tokens. Every token in this list is assigned a unique integer ID. During inference, the model predicts the ID of the next token, which the tokenizer then converts back into the text you see on the screen.
Efficiency and Cost
Tokenization directly impacts the cost and speed of AI. Since LLMs have a "context window" (a limit on how many tokens they can process at once), a more efficient tokenizer that uses fewer tokens to represent the same text allows the model to "remember" more information. This is why different models (Gemini vs. Llama) might have different costs for the same paragraph of text.
"A token is the smallest unit of meaning a model understands; it is rarely a whole word and often a fragment, prefix, or punctuation mark."
Frequently Asked Questions
Is one token always one word?+
Why don't we just use characters?+
Join the EulerFold community
Track progress and collaborate on roadmaps with students worldwide.
Recommended Readings
The author of this article utilized generative AI (Google Gemini 3.1 Pro) to assist in part of the drafting and editing process.