For a long time, researchers believed that simply adding more layers to a neural network would make it more powerful. However, they quickly hit a wall: as networks grew deeper, they became harder to train. The models weren't just slow; they often stopped learning entirely. This phenomenon is known as the Vanishing Gradient Problem.
The Chain of Multiplication
To understand why gradients vanish, we have to look at how Backpropagation works. To calculate the adjustment for a weight in an early layer, the network uses the Chain Rule, multiplying derivatives from every subsequent layer. If the derivatives of the activation functions (like the classic Sigmoid) are small (less than 1), multiplying them dozens of times causes the final value to shrink toward zero. By the time the "signal" reaches the first layers of the network, it is so small that the weights barely change, leaving the foundation of the model untrained.
The Impact on RNNs
This problem was particularly devastating for Recurrent Neural Networks (RNNs). In an RNN, the same weights are applied over and over for every step in a sequence. If you are trying to understand the beginning of a long paragraph to predict the final word, the gradient must travel back through every single word. The signal often vanishes long before it reaches the beginning, making it impossible for the model to capture "long-range dependencies." This is the primary reason why architectures like the Transformer were developed-to process sequences without this sequential decay.
Beyond the Zero
Solving the vanishing gradient problem was the "unlock" that enabled the current era of Deep Learning. Techniques like ReLU (Rectified Linear Unit) activation functions, which don't saturate as easily as Sigmoid, and Batch Normalization, which keeps signals within a healthy range, were critical. Perhaps most importantly, Residual Connections (as seen in ResNets) provided a "highway" for gradients to flow through the network without being multiplied into oblivion.
It raises an interesting question: as we build models with millions of layers or complex recursive structures, will we encounter new forms of gradient instability that require even more radical architectural changes?
"As gradients are propagated backward through many layers, they are repeatedly multiplied by weights; if those weights are small, the gradient shrinks exponentially, leaving early layers with no signal to update."
Frequently Asked Questions
Does this only happen in very deep networks?+
How did we solve it?+
Join the EulerFold community
Track progress and collaborate on roadmaps with students worldwide.
Recommended Readings
The author of this article utilized generative AI (Google Gemini 3.1 Pro) to assist in part of the drafting and editing process.