How LSTM networks solve the problem of vanishing gradients

Reference1 (作者对梯度消失的理解有问题，但计算还是可以参考的)

Reference2 (直接看这篇就行):

Why LSTMs Stop Your Gradients From Vanishing: A View from the Backwards Pass

RNN

Stucture:

1*kShd8wKAsE-i5FQ8nkI_QA.webp

$\sigma$ here means activation function, not sigmoid

Vanishing gradients

$\displaystyle \frac{\partial E}{\partial W}=\sum_{t=1}^T \frac{\partial E_t}{\partial W}$

1*n4HvvaR9z54WHv_ZUjaE4w.webp

1*TrJaCuHJDVGVeLsVC4m9vQ.webp

$$\sigma$ here means tanh function, its derivative always less than 1$

$\sigma$ here means tanh function, its derivative always less than 1

这一时刻趋于0，不是总梯度趋于0

If the gradient vanishes it means the earlier hidden states have no real effect on the later hidden states, meaning no long term dependencies are learned.

Gradients vanish. It can also explode.

看了一下评论，我的理解是梯度消失准确来说指的是前面时刻对参数求导梯度的消失，说的是较前时刻的结果，对参数变化的贡献度减少，而不是指最后输出对参数求导后梯度消失或爆炸。

LSTM

Structure:

1*kShd8wKAsE-i5FQ8nkI_QA.webp