pytorch_scatter icon indicating copy to clipboard operation
pytorch_scatter copied to clipboard

PyTorch1.10+cu113 CUDA Graph for torch_scatter raises RuntimeError w/o argument `out`

Open liulixinkerry opened this issue 3 years ago • 8 comments

Dear authors, I find that if we use torch_scatter in CUDA graph. The output memory should be pre-allocated (the param out) in scatter_mean, scatter_max, scatter_add, etc. Meanwhile, since scatter_softmax does not support out argument, scatter_softmax + CUDA Graph always raises the same RuntimeError as scatter_mean w/o given out.

from torch_scatter import scatter_mean, scatter_max, scatter_add
import torch
num_iters = 1000

device = torch.device("cuda:0")

src = torch.Tensor([[2, 0, 1, 4, 3], [0, 2, 1, 3, 4]]).to(device)
index = torch.tensor([[4, 5, 4, 2, 3], [0, 0, 2, 2, 1]]).to(device)

print("Param out is given")
out_mean = src.new_zeros((2, 6))
out_max = src.new_zeros((2, 6))
out_add = src.new_zeros((2, 6))
s = torch.cuda.Stream()
s.wait_stream(torch.cuda.current_stream())
with torch.cuda.stream(s):
    for i in range(3):
        out_mean = scatter_mean(src, index, out=out_mean)
        out_max, _ = scatter_max(src, index, out=out_max)
        out_add = scatter_add(src, index, out=out_add)
torch.cuda.current_stream().wait_stream(s)

g = torch.cuda.CUDAGraph()
with torch.cuda.graph(g):
    out_mean = scatter_mean(src, index, out=out_mean)
    out_max, _ = scatter_max(src, index, out=out_max)
    out_add = scatter_add(src, index, out=out_add)

for i in range(num_iters):
    g.replay()
torch.cuda.synchronize(0)
print("out_mean: ", out_mean)
print("out_max: ", out_max)
print("out_add: ", out_add)

print("============")
print("Param out is not given")
s = torch.cuda.Stream()
s.wait_stream(torch.cuda.current_stream())
with torch.cuda.stream(s):
    for i in range(3):
        out_mean = scatter_mean(src, index)
        out_max, _ = scatter_max(src, index)
        out_add = scatter_add(src, index)
torch.cuda.current_stream().wait_stream(s)

g = torch.cuda.CUDAGraph()
with torch.cuda.graph(g):
    out_mean = scatter_mean(src, index)
    out_max, _ = scatter_max(src, index)
    out_add = scatter_add(src, index)

for i in range(num_iters):
    g.replay()
torch.cuda.synchronize(0)
print("out_mean: ", out_mean)
print("out_max: ", out_max)
print("out_add: ", out_add)

Output:

Param out is given
out_mean:  tensor([[0.0000e+00, 0.0000e+00, 4.0120e+03, 3.0090e+03, 3.0000e+00, 0.0000e+00],
        [2.0000e+00, 4.0120e+03, 4.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00]],
       device='cuda:0')
out_max:  tensor([[0., 0., 4., 3., 2., 0.],
        [2., 4., 3., 0., 0., 0.]], device='cuda:0')
out_add:  tensor([[   0.,    0., 4012., 3009., 3009.,    0.],
        [2006., 4012., 4012.,    0.,    0.,    0.]], device='cuda:0')
============
Param out is not given
Traceback (most recent call last):
  File "test.py", line 50, in <module>
    out_mean = scatter_mean(src, index)
  File "~/anaconda3/lib/python3.8/site-packages/torch_scatter/scatter.py", line 41, in scatter_mean
    out = scatter_sum(src, index, dim, out, dim_size)
  File "~/anaconda3/lib/python3.8/site-packages/torch_scatter/scatter.py", line 19, in scatter_sum
    size[dim] = int(index.max()) + 1
RuntimeError: CUDA error: operation not permitted when stream is capturing
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "test.py", line 52, in <module>
    out_add = scatter_add(src, index)
  File "~/anaconda3/lib/python3.8/site-packages/torch/cuda/graphs.py", line 149, in __exit__
    self.cuda_graph.capture_end()
  File "~/anaconda3/lib/python3.8/site-packages/torch/cuda/graphs.py", line 71, in capture_end
    super(CUDAGraph, self).capture_end()
RuntimeError: CUDA error: operation failed due to a previous error during capture
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1

liulixinkerry avatar Oct 23 '21 02:10 liulixinkerry

I wonder whether it is good to mention this known issue in the documentation.

liulixinkerry avatar Oct 23 '21 02:10 liulixinkerry

The problem may come from ops that synchronize the CPU with the GPU (getting tensor shape). However, it seems that PyTorch CUDA graph cannot support that.

liulixinkerry avatar Oct 23 '21 03:10 liulixinkerry

Thanks for reporting. I will look into this.

rusty1s avatar Oct 23 '21 05:10 rusty1s

Sorry for the delayed response. Note that setting dim_size works as well (which is also supported in scatter_softmax).

rusty1s avatar Nov 14 '21 05:11 rusty1s

Thx for you reply. Does that mean we can set the dim_size to achieve pre-allocated out?

liulixinkerry avatar Nov 14 '21 06:11 liulixinkerry

Yes, this avoids the dynamic allocation of out.

rusty1s avatar Nov 14 '21 08:11 rusty1s

which version of torch scatter, are you using with pytorch 1.10

nihalgupta84 avatar May 08 '23 09:05 nihalgupta84

I think latest should work, but wheels are only provided for torch-scatter=2.0.9 as its latest version.

rusty1s avatar May 09 '23 05:05 rusty1s