Neural networks cannot "see" letters. They only understand numbers. Tokenization is the critical first step in the AI pipeline-it is the process of breaking down raw text into a sequence of discrete units called Tokens, which are then mapped to unique numerical IDs.

The Subword Revolution

Early AI models used word-level tokenization. However, if the model encountered a word it hadn't seen before (an "Out of Vocabulary" or OOV word), it would break. Modern models use Byte Pair Encoding (BPE) or similar subword algorithms.

Instead of treating "unbelievable" as one token, the tokenizer might break it into un, believ, and able. This allows the model to understand the meaning of new words by looking at their constituent parts.

The Vocabulary

The "Vocabulary" of a model is a fixed list of all the tokens it knows. GPT-4, for example, has a vocabulary of roughly 100,000 tokens. Every token in this list is assigned a unique integer ID. During inference, the model predicts the ID of the next token, which the tokenizer then converts back into the text you see on the screen.

Efficiency and Cost

Tokenization directly impacts the cost and speed of AI. Since LLMs have a "context window" (a limit on how many tokens they can process at once), a more efficient tokenizer that uses fewer tokens to represent the same text allows the model to "remember" more information. This is why different models (Gemini vs. Llama) might have different costs for the same paragraph of text.

Tokens and Tokenization

The Subword Revolution

The Vocabulary

Efficiency and Cost

Frequently Asked Questions

Join the EulerFold community

Recommended Readings