For a long time, researchers believed that simply adding more layers to a neural network would make it more powerful. However, they quickly hit a wall: as networks grew deeper, they became harder to train. The models weren't just slow; they often stopped learning entirely. This phenomenon is known as the Vanishing Gradient Problem.

The Chain of Multiplication

To understand why gradients vanish, we have to look at how Backpropagation works. To calculate the adjustment for a weight in an early layer, the network uses the Chain Rule, multiplying derivatives from every subsequent layer. If the derivatives of the activation functions (like the classic Sigmoid) are small (less than 1), multiplying them dozens of times causes the final value to shrink toward zero. By the time the "signal" reaches the first layers of the network, it is so small that the weights barely change, leaving the foundation of the model untrained.

The Impact on RNNs

This problem was particularly devastating for Recurrent Neural Networks (RNNs). In an RNN, the same weights are applied over and over for every step in a sequence. If you are trying to understand the beginning of a long paragraph to predict the final word, the gradient must travel back through every single word. The signal often vanishes long before it reaches the beginning, making it impossible for the model to capture "long-range dependencies." This is the primary reason why architectures like the Transformer were developed-to process sequences without this sequential decay.

Beyond the Zero

Solving the vanishing gradient problem was the "unlock" that enabled the current era of Deep Learning. Techniques like ReLU (Rectified Linear Unit) activation functions, which don't saturate as easily as Sigmoid, and Batch Normalization, which keeps signals within a healthy range, were critical. Perhaps most importantly, Residual Connections (as seen in ResNets) provided a "highway" for gradients to flow through the network without being multiplied into oblivion.

It raises an interesting question: as we build models with millions of layers or complex recursive structures, will we encounter new forms of gradient instability that require even more radical architectural changes?

The Vanishing Gradient Problem

The Chain of Multiplication

The Impact on RNNs

Beyond the Zero

Frequently Asked Questions

Join the EulerFold community

Recommended Readings