tensorboardX icon indicating copy to clipboard operation
tensorboardX copied to clipboard

Variable slice/index assignment graph breaking

Open filipeabperes opened this issue 6 years ago • 18 comments

I have been facing an issue when trying to create a graph of a Module in which some Variables have slice assignment operations in them. I have reduced the problem to the following example, ignore the commented out V = x for now.

import torch
from torch.autograd import Variable
from tensorboardX import SummaryWriter


class DummyModule(torch.nn.Module):
    def forward(self, x):
        V = Variable(torch.Tensor(2, 2))
        V[0, 0] = x
        # V = x
        return torch.sum(V * 3)


x = Variable(torch.Tensor([1]), requires_grad=True)
r = DummyModule()(x)
r.backward()
print(x.grad)

w = SummaryWriter()
x = Variable(torch.Tensor([1]), requires_grad=True)
w.add_graph(DummyModule(), x, verbose=True)

The output from this is below, showing that the gradients are flowing all right, but the graph is not being connected. If I insert another input Variable and other operations in the Module, add_graph() works fine without throwing an error, but the graph show a disconnected input for x, so I suppose the nature of this error is that the only input Variable available is being interpreted as disconnected.

Variable containing:
 3
[torch.FloatTensor of size (1,)]

Traceback (most recent call last):
  File "test_grad.py", line 21, in <module>
    w.add_graph(DummyModule(), x, verbose=True)
  File "/Users/filiped/anaconda/envs/pytorch0.4/lib/python3.6/site-packages/tensorboardX/writer.py", line 400, in add_graph
    self.file_writer.add_graph(graph(model, input_to_model, verbose))
  File "/Users/filiped/anaconda/envs/pytorch0.4/lib/python3.6/site-packages/tensorboardX/graph.py", line 44, in graph
    trace, _ = torch.jit.trace(model, args)
  File "/Users/filiped/anaconda/envs/pytorch0.4/lib/python3.6/site-packages/torch/jit/__init__.py", line 251, in trace
    return TracedModule(f, nderivs=nderivs)(*args, **kwargs)
  File "/Users/filiped/anaconda/envs/pytorch0.4/lib/python3.6/site-packages/torch/nn/modules/module.py", line 357, in __call__
    result = self.forward(*input, **kwargs)
  File "/Users/filiped/anaconda/envs/pytorch0.4/lib/python3.6/site-packages/torch/jit/__init__.py", line 287, in forward
    torch._C._tracer_exit(out_vars)
RuntimeError: /Users/filiped/pytorch/torch/csrc/jit/tracer.h:117: getTracingState: Assertion `state` failed.

Moreover, if you uncomment the line V = x and comment the line above it, so that no slice/index assign operation is performed, you get, as expected:

Variable containing:
 3
[torch.FloatTensor of size (1,)]

graph(%0 : Float(1)) {
  %1 : UNKNOWN_TYPE = Constant[value={3}](), scope: DummyModule
  %2 : Float(1) = Mul[broadcast=1](%0, %1), scope: DummyModule
  %3 : Float() = Sum(%2), scope: DummyModule
  return (%3);
}

This was all executed in Pytorch 0.4

(Edits: Did a couple rounds of re-simplifying the example.)

filipeabperes avatar Feb 15 '18 05:02 filipeabperes

In Pytorch 0.3.1 this does not seem to be a problem, with the output being:

Variable containing:
 3
[torch.FloatTensor of size 1]

/Users/filiped/anaconda/lib/python3.6/site-packages/torch/onnx/__init__.py:244: UserWarning: ONNX export failed on Constant because torch.onnx.symbolic.Constant does not exist
  .format(op_name, op_name))
/Users/filiped/anaconda/lib/python3.6/site-packages/torch/onnx/__init__.py:244: UserWarning: ONNX export failed on sum because torch.onnx.symbolic.sum does not exist
  .format(op_name, op_name))
graph(%1 : Float(1)) {
  %2 : Float(2, 2) = Constant[value= 1.0000e+00 -4.6566e-10 -6.4371e+05  1.0845e-19 [ CPUFloatTensor{2,2} ]](), uses = [%3.i0], scope: DummyModule;
  %4 : Float(2, 2), %5 : Handle = ^SetItem((0, 0))(%2, %1), uses = [[%7.i0], []], scope: DummyModule;
  %6 : UNKNOWN_TYPE = Constant[value={3}](), uses = [%7.i1], scope: DummyModule;
  %7 : Float(2, 2) = Mul[broadcast=1](%4, %6), uses = [%8.i0], scope: DummyModule;
  %8 : Float() = sum(%7), uses = [%0.i0], scope: DummyModule;
  return (%8);
}

filipeabperes avatar Feb 15 '18 17:02 filipeabperes

Thanks for the report. The code path (of tensorboardX) for v0.3 and v0.4 are different. As for code you used, v0.3 or v0.3.1 should use onnx export as a buffer to add the graph, while in v0.4, tensorboardX export graph much directly.

I just merged a patch #83 so that tensorboardX have similar behavior for v0.3.1 and v0.4. I think (this patch +pytorch v0.3.1) should also fail on your code.

I will inspect this once the CI test is passed.

lanpa avatar Feb 15 '18 18:02 lanpa

@lanpa the results in Pytorch 0.3.1 that I reported above were already with the patched pulled in

filipeabperes avatar Feb 15 '18 18:02 filipeabperes

@filipeabperes Do you mean 24a0d77?

lanpa avatar Feb 15 '18 18:02 lanpa

@lanpa Yeah, I just re-pulled from the repo and ran the example to test, with same results as above

filipeabperes avatar Feb 15 '18 18:02 filipeabperes

I replaced w.add_graph(DummyModule(), x, verbose=True) with torch.onnx.export(DummyModule(), x, "./IndexLayer.pb", verbose=True) In pytorch v0.4, I got a same error message.

...
getTracingState: Assertion `state` failed.

In pytorch v0.3.1, it becomes:

Traceback (most recent call last):
  File "fili.py", line 22, in <module>
    torch.onnx.export(DummyModule(), x, "./IndexLayer.pb", verbose=True)
  File "/Users/dexter/anaconda3/lib/python3.6/site-packages/torch/onnx/__init__.py", line 75, in export
    _export(model, args, f, export_params, verbose, training)
  File "/Users/dexter/anaconda3/lib/python3.6/site-packages/torch/onnx/__init__.py", line 131, in _export
    proto = trace.export(list(model.state_dict().values()), _onnx_opset_version)
RuntimeError: ONNX export failed: Couldn't export Python operator SetItem

Graph we tried to export:
graph(%1 : Float(1)) {
  %2 : Float(2, 2) = Constant[value= 1.0000e+00 -2.5244e-29  4.5598e+20 -1.0845e-19 [ CPUFloatTensor{2,2} ]](), uses = [%3.i0], scope: DummyModule;
  %4 : Float(2, 2), %5 : Handle = ^SetItem((0, 0))(%2, %1), uses = [[%7.i0], []], scope: DummyModule;
  %6 : UNKNOWN_TYPE = Constant[value={3}](), uses = [%7.i1], scope: DummyModule;
  %7 : Float(2, 2) = Mul[broadcast=1](%4, %6), uses = [%8.i0], scope: DummyModule;
  %8 : Float() = sum(%7), uses = [%0.i0], scope: DummyModule;
  return (%8);
}

Look like there is still problem in v0.3.1. I think this bug needs to be reported to onnx developers.

lanpa avatar Feb 15 '18 19:02 lanpa

Maybe related, I have noticed a similar issue in 0.3.1 when using slices. Example below:

import torch
from torch.autograd import Variable
from tensorboardX import SummaryWriter


class DummyModule(torch.nn.Module):
    def forward(self, x):
        V = Variable(torch.Tensor(2, 2))
        V[0, 0] = x[0:1]
        # V[0, 0] = x[0]
        return torch.sum(V * 3)


x = Variable(torch.Tensor([1, 1, 1]), requires_grad=True)
r = DummyModule()(x)
r.backward()

print(x.grad)

w = SummaryWriter()
x = Variable(torch.Tensor([1, 1, 1]), requires_grad=True)
w.add_graph(DummyModule(), x, verbose=True)

Which gives the following output. As before, switching the commented lines removes the problem, so it seems to be particular to slicing.

Variable containing:
 3
 0
 0
[torch.FloatTensor of size 3]

Traceback (most recent call last):
  File "slice_grad.py", line 22, in <module>
    w.add_graph(DummyModule(), x, verbose=True)
  File "/Users/filiped/anaconda/lib/python3.6/site-packages/tensorboardX-1.0-py3.6.egg/tensorboardX/writer.py", line 400, in add_graph
    self.file_writer.add_graph(graph(model, input_to_model, verbose))
  File "/Users/filiped/anaconda/lib/python3.6/site-packages/tensorboardX-1.0-py3.6.egg/tensorboardX/graph.py", line 54, in graph
    torch.onnx._optimize_trace(trace)
  File "/Users/filiped/anaconda/lib/python3.6/site-packages/torch/onnx/__init__.py", line 81, in _optimize_trace
    torch._C._jit_pass_onnx(trace)
  File "/Users/filiped/anaconda/lib/python3.6/site-packages/torch/onnx/__init__.py", line 148, in _run_symbolic_method
    return symbolic_fn(*args)
  File "/Users/filiped/anaconda/lib/python3.6/site-packages/torch/autograd/_functions/tensor.py", line 77, in symbolic
    raise ValueError('Unsupported index type {}'.format(type(index)))
ValueError: Unsupported index type <class 'slice'>

filipeabperes avatar Feb 15 '18 19:02 filipeabperes

Looks like this PR implements slice. https://github.com/pytorch/pytorch/pull/5204

lanpa avatar Feb 16 '18 15:02 lanpa

That was only merged into 0.4, right? I just tested and it doesn't seem to fix the 0.4 problem.

filipeabperes avatar Feb 16 '18 17:02 filipeabperes

Tested with 0.4.0a0+063946d and still the same.

I modified the module to:

class DummyModule(torch.nn.Module):
    def __init__(self):
        super(DummyModule, self).__init__()
        self.V =  torch.nn.Parameter(torch.Tensor(2, 2))


    def forward(self, x):
        self.V[0, 0] = x
        return torch.sum(self.V)

but now r.backward() triggers: File "/home/dexter/anaconda3/lib/python3.6/site-packages/torch/autograd/init.py", line 81, in backward variables, grad_variables, retain_graph, create_graph) RuntimeError: leaf variable has been moved into the graph interior

lanpa avatar Mar 28 '18 15:03 lanpa

I got a similar issue without tensorboardX and torch.nn.Parameter.

I simply use a torch.Tensor (dtype=float64) and try to set some values in it. I even tried to use the scatter_ function but it did not work either.

My code is basically:

import torch

# initialize tensor
tensor = torch.zeros((1, 400, 400)).double()
tensor.requires_grad_(True)

# create index ranges
x_range = torch.arange(150, 250).double()
x_range.requires_grad_(True)
y_range = torch.arange(150, 250).double()
y_range.requires_grad_(True)

# get indices of flattened tensor
x_range = x_range.long().repeat(100, 1)
y_range = y_range.long().repeat(100, 1)
y_range = y_range.t()
tensor_size = tensor.size()
indices = y_range.sub(1).mul(tensor_size[2]).add(x_range).view((1, -1))

# create patch
patch = torch.ones((1, 100, 100)).double()


# flatten tensor
tensor_flattened = tensor.contiguous().view((1, -1))

# set patch to cells of tensor_flattend at indices and reshape tensor
tensor_flattened.scatter_(1, indices, patch.view(1, -1))
tensor = tensor_flattened.view(tensor_size)

# sum up for scalar output for calling backward()
tensor_sum = tensor.sum()

# calling backward()
tensor_sum.backward()

# alternative to avoid summing tensor:
tensor.backward(torch.ones_like(tensor))

seems like this issue is not caused by tensorboardX

justusschock avatar May 04 '18 07:05 justusschock

update: still not working in pytorch 0.4 release + tensorboardX master. output of tensorboardX:

Error occurs, No graph saved
Checking if it's onnx problem...
Your model fails onnx too, please report to onnx team

lanpa avatar May 08 '18 17:05 lanpa

Has someone reported this to ONNX team? I'm currently busy now but could do it in a couple weeks if not.

filipeabperes avatar May 09 '18 02:05 filipeabperes

I'm having the same error converting numpy mdim array to tensor in the def forward

sgarcia22 avatar Jun 16 '18 00:06 sgarcia22

@sgarcia22 How can numpy array cause this error?

lanpa avatar Jun 18 '18 18:06 lanpa

This issue is solved in new onnx. closing this.

lanpa avatar Dec 29 '18 17:12 lanpa

Is this really fixed? By running the following code, I get

import torch
from tensorboardX import SummaryWriter


class DummyModule(torch.nn.Module):
    def forward(self, x):
        V = torch.zeros(2, 2)
        V[0, 0] = x
        # V = x
        return torch.sum(V * 3)

x = torch.tensor([1.0], requires_grad=True)
r = DummyModule()(x)
r.backward()
print(x.grad)

w = SummaryWriter()
x = torch.tensor([1.0], requires_grad=True)
w.add_graph(DummyModule(), x, verbose=True)
tensor([3.])
graph(%0 : Float(1)) {
  %1 : Float() = onnx::Constant[value={0}]()
  return (%1);
}

And again with the V = x uncommented I get

tensor([3.])
graph(%0 : Float(1)) {
  %1 : Tensor = onnx::Constant[value={3}](), scope: DummyModule
  %2 : Float(1) = onnx::Mul(%0, %1), scope: DummyModule
  %3 : Float() = onnx::ReduceSum[keepdims=0](%2), scope: DummyModule
  return (%3);
}

filipeabperes avatar Feb 17 '19 23:02 filipeabperes

Interesting, I closed this because the onnx error is disappeared. Reopen it.

lanpa avatar Mar 03 '19 13:03 lanpa