pytorch-NeuCom backward through cumprod

Hey, In function get_allocation_weight() in memory.py, cumprod is used, but as far as I know, this operation does not have autograd support yet and is still in PR: https://github.com/pytorch/pytorch/pull/1439, so I'm really confused as how is the backward passed through this operation for you? Is the resulting variables not in the computation graph? Thanks in advance! Really nice repository:)

May 09 '17 16:05 jingweiz

@AjayTalati You said you also have an implementation so I guessed you also used the cumprod for computing the allocation weights, for me the backward simply fails with cumprod, do you maybe have an idea why? Thanks a lot!

May 09 '17 16:05 jingweiz

Ah I just found that you have a custom cumprod function. But if I am reading it correctly, this is treating the input only as a tensor but not as a variable, then all the computation graph on the variable will be lost after this function right? Why this one does not need to be in the graph? Maybe there're something obvious that I'm missing?

May 09 '17 17:05 jingweiz

Hi @jingweiz

for me the backward simply fails with cumprod,

Can you post your error messages?

May 09 '17 19:05 ajaytalati

If the input is tensor, then it will call the torch.cumprod. But if the input is variable, it will call the custom cumprod which is written using autograd torch function. So it should work just fine.

May 09 '17 19:05 ypxie

@ypxie Hey, thanks a lot for the reply! Can you explain further why it is a leaf node? When we input a sequence row by row, doesn't the current usage_t depend on the usage_{t-1}? I'm currently under the impression that all those intermediate calculations should be carried out in variables so as to keep the computation history for the complete sequence, then those variables are only reset and detached from the graph when there's a new sequence coming in. Or what is your understanding?

May 09 '17 20:05 jingweiz

@AjayTalati Hey, it's not an error from this repo, but when I use the current cumprod from pytorch, and is due to the autograd has not been implemented for this op yet: https://discuss.pytorch.org/t/cumprod-exclusive-true-equivalences/2614/8 and https://github.com/pytorch/pytorch/pull/1439

May 09 '17 20:05 jingweiz

For example, index_mapper is a leaf node, so it doesn't matter that it is constructed online from a numpy array during the process. For the gradient of cumprod, please take a look at updated response.

May 09 '17 20:05 ypxie

@ypxie I mean the allocation_weight depend on the usage, then the write_weight depend on the allocation weight, then the memory is updated with the write_weight, and the usage_{t} is also calculated out of usage_{t-1}. If the usage is somehow dropped from the graph, then all the history computations it carried will be lost, then the grad would not pass back to the older time steps; in this case the backward would still work, but it's just that the earlier part of the graph won't get their gradients cos the connection is lost from the detached usage. What do you think?

May 09 '17 20:05 jingweiz

So my solution is that: instead of just use the data from usage and drop the variable, I just implement a fake_cumprod using the existing torch ops, this way the backward would still work:

def fake_cumprod(vb):
    """
    args:
        vb:  [hei x wid] Variable
          -> NOTE: we are lazy here so now it only supports cumprod along wid
    """
    # real_cumprod = torch.cumprod(vb.data, 1)
    vb = vb.unsqueeze(0)
    mul_mask_vb = Variable(torch.zeros(vb.size(2), vb.size(1), vb.size(2))).type_as(vb)
    for i in range(vb.size(2)):
       mul_mask_vb[i, :, :i+1] = 1
    add_mask_vb = 1 - mul_mask_vb
    vb = vb.expand_as(mul_mask_vb) * mul_mask_vb + add_mask_vb
    vb = torch.prod(vb, 2).transpose(0, 2)
    # print(real_cumprod - vb.data) # NOTE: checked, ==0
    return vb

May 09 '17 21:05 jingweiz

the custom cumprod should work well with autograd, cause it is implemented based on the autograd torch function.

May 09 '17 21:05 ypxie

@ypxie Hey, thanks for the quick reply:) What I'm confused is Line 98 in utils.py:

        output = Variable(inputs.data.new(*shape_).fill_(1.0), requires_grad = True)

Here the inputs is detached from the graph right? According to here: https://discuss.pytorch.org/t/help-clarifying-repackage-hidden-in-word-language-model/226/2?u=apaszke

May 09 '17 21:05 jingweiz

you can delete this line, it is not used in the following. I will remove it.

May 09 '17 21:05 ypxie