backward through cumprod
Hey,
In function get_allocation_weight() in memory.py, cumprod is used, but as far as I know, this operation does not have autograd support yet and is still in PR: https://github.com/pytorch/pytorch/pull/1439, so I'm really confused as how is the backward passed through this operation for you? Is the resulting variables not in the computation graph?
Thanks in advance! Really nice repository:)
@AjayTalati
You said you also have an implementation so I guessed you also used the cumprod for computing the allocation weights, for me the backward simply fails with cumprod, do you maybe have an idea why?
Thanks a lot!
Ah I just found that you have a custom cumprod function. But if I am reading it correctly, this is treating the input only as a tensor but not as a variable, then all the computation graph on the variable will be lost after this function right? Why this one does not need to be in the graph? Maybe there're something obvious that I'm missing?
Hi @jingweiz
for me the backward simply fails with
cumprod,
Can you post your error messages?
If the input is tensor, then it will call the torch.cumprod. But if the input is variable, it will call the custom cumprod which is written using autograd torch function. So it should work just fine.
@ypxie Hey, thanks a lot for the reply! Can you explain further why it is a leaf node? When we input a sequence row by row, doesn't the current usage_t depend on the usage_{t-1}? I'm currently under the impression that all those intermediate calculations should be carried out in variables so as to keep the computation history for the complete sequence, then those variables are only reset and detached from the graph when there's a new sequence coming in. Or what is your understanding?
@AjayTalati
Hey, it's not an error from this repo, but when I use the current cumprod from pytorch, and is due to the autograd has not been implemented for this op yet: https://discuss.pytorch.org/t/cumprod-exclusive-true-equivalences/2614/8 and https://github.com/pytorch/pytorch/pull/1439
For example, index_mapper is a leaf node, so it doesn't matter that it is constructed online from a numpy array during the process. For the gradient of cumprod, please take a look at updated response.
@ypxie I mean the allocation_weight depend on the usage, then the write_weight depend on the allocation weight, then the memory is updated with the write_weight, and the usage_{t} is also calculated out of usage_{t-1}. If the usage is somehow dropped from the graph, then all the history computations it carried will be lost, then the grad would not pass back to the older time steps; in this case the backward would still work, but it's just that the earlier part of the graph won't get their gradients cos the connection is lost from the detached usage. What do you think?
So my solution is that: instead of just use the data from usage and drop the variable, I just implement a fake_cumprod using the existing torch ops, this way the backward would still work:
def fake_cumprod(vb):
"""
args:
vb: [hei x wid] Variable
-> NOTE: we are lazy here so now it only supports cumprod along wid
"""
# real_cumprod = torch.cumprod(vb.data, 1)
vb = vb.unsqueeze(0)
mul_mask_vb = Variable(torch.zeros(vb.size(2), vb.size(1), vb.size(2))).type_as(vb)
for i in range(vb.size(2)):
mul_mask_vb[i, :, :i+1] = 1
add_mask_vb = 1 - mul_mask_vb
vb = vb.expand_as(mul_mask_vb) * mul_mask_vb + add_mask_vb
vb = torch.prod(vb, 2).transpose(0, 2)
# print(real_cumprod - vb.data) # NOTE: checked, ==0
return vb
the custom cumprod should work well with autograd, cause it is implemented based on the autograd torch function.
@ypxie
Hey, thanks for the quick reply:)
What I'm confused is Line 98 in utils.py:
output = Variable(inputs.data.new(*shape_).fill_(1.0), requires_grad = True)
Here the inputs is detached from the graph right? According to here: https://discuss.pytorch.org/t/help-clarifying-repackage-hidden-in-word-language-model/226/2?u=apaszke
you can delete this line, it is not used in the following. I will remove it.