sru
sru copied to clipboard
Gradient calculation error?
Hi,
At the following line for gradient backpropagation, float cur = *(grad_last + col);
. cur
should be 0? grad_last
is the gradient of the last cell state c_t
. So when calculating the gradient of c_t
in the following line, it should be initialized to 0 (no gradient from c_t+1). And the value gc
should equal grad_last
. Is this the case?
const float tmp = g2*calc_grad_activation(activation_type, c_val); const float gc = gh_val*mask*tmp + cur;
https://github.com/taolei87/sru/blob/master/cuda_functional.py#L131
Thanks.
Hi @Sunnydreamrain
No, grad_last
is not necessarily always zero.
In some cases the model will pass the last cell state c_t
into subsequent model components. For example, in sequence-to-sequence task (machine translation), the last cell state of the source sentence is provided to the decoder as initial cell state. As a result, when doing gradient back propagation, the gradient (grad_last
) of c_t
should be passed back, and so does c_t-1
, c_t-2
etc.
Of course, when c_t
is never used in subsequent computation, pytorch would provide a grad_last
that's all zeros.