At the very end of almost every classification model or Large Language Model, there is a single mathematical gatekeeper: the Softmax Function. Its job is to take a set of raw, arbitrary numbers (logits) and "squash" them into a probability distribution where every number is between 0 and 1, and the total sum is exactly 1.0.

Why Exponentiate?

The Softmax formula is: $\sigma(z)_i = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}$

By using the exponential function ( $e^x$ ), Softmax does two things:

Ensures Positivity: $e^x$ is always positive, even if the input logit is negative.
Amplifies Differences: It makes the "winning" score much larger relative to the "losers." If one logit is slightly higher than the others, its probability after Softmax will be significantly higher.

Picking a Winner

In a Large Language Model, the final layer produces a logit for every single token in its vocabulary (e.g., 100,000 values). Softmax turns these 100,000 raw numbers into a probability map. The model doesn't just "know" the next word; it has a statistical preference for several likely words.

Softmax vs. Sigmoid

While Sigmoid is used for binary classification (Yes/No), Softmax is for multi-class classification. Softmax is "mutually exclusive"-if the probability of one class goes up, the others must go down. This makes it ideal for choosing the next most likely token in a sequence.

Frequently Asked Questions

What are 'Logits'?+

Logits are the raw, unnormalized output values from the last layer of a neural network before they are passed through Softmax. They can be any real number, positive or negative.

What is 'Temperature' in Softmax?+

Temperature (T) is a parameter that scales the logits before Softmax. A high T makes the distribution flatter (more random), while a low T makes it sharper (more confident).

The Softmax Function

Why Exponentiate?

Picking a Winner

Softmax vs. Sigmoid

Frequently Asked Questions

Join the EulerFold community

Recommended Readings