The Softmax Function

By EulerFold / April 25, 2026
The Softmax Function

At the very end of almost every classification model or Large Language Model, there is a single mathematical gatekeeper: the Softmax Function. Its job is to take a set of raw, arbitrary numbers (logits) and "squash" them into a probability distribution where every number is between 0 and 1, and the total sum is exactly 1.0.

1. Raw Logits (Unnormalized)2. Softmax Core3. Probability DistributionMathematical LogicLogit Z₁: 2.0Logit Z₂: 1.0Logit Z₃: 0.1Exponentiation (e^z)Sum & Divide (Σ e^z)Class 1: 0.66 (66%)Class 2: 0.24 (24%)Class 3: 0.10 (10%)Total Sum ≡ 1.0High Logit -> Exponentially Higher Probe^2.0 ≈ 7.39e^1.0 ≈ 2.72e^0.1 ≈ 1.11Σ ≈ 11.22Pᵢ = e^zi / Σ Amplify GapsArbitrary RealsValid Probabilities

Why Exponentiate?

The Softmax formula is: σ(z)i=ezij=1Kezj\sigma(z)_i = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}

By using the exponential function (exe^x), Softmax does two things:

  1. Ensures Positivity: exe^x is always positive, even if the input logit is negative.
  2. Amplifies Differences: It makes the "winning" score much larger relative to the "losers." If one logit is slightly higher than the others, its probability after Softmax will be significantly higher.

Picking a Winner

In a Large Language Model, the final layer produces a logit for every single token in its vocabulary (e.g., 100,000 values). Softmax turns these 100,000 raw numbers into a probability map. The model doesn't just "know" the next word; it has a statistical preference for several likely words.

Softmax vs. Sigmoid

While Sigmoid is used for binary classification (Yes/No), Softmax is for multi-class classification. Softmax is "mutually exclusive"-if the probability of one class goes up, the others must go down. This makes it ideal for choosing the next most likely token in a sequence.

"Softmax doesn't just normalize values; it uses exponentiation to aggressively amplify the highest score, creating a clear winner while ensuring the total sum is exactly 1.0."

Frequently Asked Questions

What are 'Logits'?+
Logits are the raw, unnormalized output values from the last layer of a neural network before they are passed through Softmax. They can be any real number, positive or negative.
What is 'Temperature' in Softmax?+
Temperature (T) is a parameter that scales the logits before Softmax. A high T makes the distribution flatter (more random), while a low T makes it sharper (more confident).
EulerFold Intelligence

Join the EulerFold community

Track progress and collaborate on roadmaps with students worldwide.

🐢

Recommended Readings

From the Glossary
Research Decoded

The author of this article utilized generative AI (Google Gemini 3.1 Pro) to assist in part of the drafting and editing process.