icefall
icefall copied to clipboard
Clip rnn gradients in a chunk-wise manner
This PR aims to clip the rnn gradients in a chunk-wise manner, to solve the gradient explosion problem in the backward pass. When computing each chunk, we clip the gradients of hidden states and cell states which are passed between chunks.
The gradient clipping strategy applied on hidden states and cell states is as follows:
- If the gradient norm is lager than a specific threshould, we directly zero the gradients.
- Scaling down the gradient by a factor of 0.9.
- Limit the gradient norm to a maximum.
Now I am running the experiments with following options:
--rnn-clip-grad 1 --rnn-chunk-size 20 --rnn-grad-scale-factor 1.0 --rnn-grad-max-norm 0.5--rnn-clip-grad 1 --rnn-chunk-size 20 --rnn-grad-scale-factor 1.0 --rnn-grad-max-norm 1.0--rnn-clip-grad 1 --rnn-chunk-size 20 --rnn-grad-scale-factor 1.0 --rnn-grad-max-norm 2.0