Pytorch-Correlation-extension icon indicating copy to clipboard operation
Pytorch-Correlation-extension copied to clipboard

CUDA Error when pytorch distribution training...

Open liming-ai opened this issue 4 years ago • 6 comments

Hi, thanks for your contribution, when I using distribution training, there is always RuntimeError: RuntimeError: CUDA error: invalid device function, here is my test code:

import torch
from spatial_correlation_sampler import SpatialCorrelationSampler

device = "cuda"
batch_size = 1
channel = 1
H = 10
W = 10
dtype = torch.float32

input1 = torch.randint(1, 4, (batch_size, channel, H, W), dtype=dtype, device=device, requires_grad=True)
input2 = torch.randint_like(input1, 1, 4).requires_grad_(True)

correlation_sampler = SpatialCorrelationSampler(
    kernel_size=3,
    patch_size=1,
    stride=2,
    padding=0,
    dilation=2,
    dilation_patch=1)

model = torch.nn.DataParallel(correlation_sampler, device_ids=[0,1,2]).cuda()

out = model(input1, input2)

print(out.shape)

My enviroment is

Ubuntu 18.04.5 LTS
PyTorch -- 1.6.0
torchvision -- 0.7.0
gcc -- 7.5.0
CUDA -- 10.2

The whole error info is:

Traceback (most recent call last):
  File "test.py", line 24, in <module>
    out = model(input1, input2)
  File "/home/liming/anaconda3/envs/motionsqueeze/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/liming/anaconda3/envs/motionsqueeze/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 156, in forward
    return self.gather(outputs, self.output_device)
  File "/home/liming/anaconda3/envs/motionsqueeze/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 168, in gather
    return gather(outputs, output_device, dim=self.dim)
  File "/home/liming/anaconda3/envs/motionsqueeze/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather
    res = gather_map(outputs)
  File "/home/liming/anaconda3/envs/motionsqueeze/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 55, in gather_map
    return Gather.apply(target_device, dim, *outputs)
  File "/home/liming/anaconda3/envs/motionsqueeze/lib/python3.8/site-packages/torch/nn/parallel/_functions.py", line 68, in forward
    return comm.gather(inputs, ctx.dim, ctx.target_device)
  File "/home/liming/anaconda3/envs/motionsqueeze/lib/python3.8/site-packages/torch/cuda/comm.py", line 166, in gather
    return torch._C._gather(tensors, dim, destination)
RuntimeError: CUDA error: invalid device function
[1]    20866 segmentation fault (core dumped)  python test.py

For un-distribution training, there is no error, but still some strange info:

torch.Size([1, 1, 1, 3, 3])
[1]    22742 segmentation fault (core dumped)  python test.py

liming-ai avatar Dec 28 '20 03:12 liming-ai

Hi, what hardware are you using ? It looks like you have different GPU and that the module is only built for the first gpu, which is not the same compute capibilities as one of your other 2 GPUs.

See an interesting PR about it here

ClementPinard avatar Dec 28 '20 11:12 ClementPinard

Hi, what hardware are you using ? It looks like you have different GPU and that the module is only built for the first gpu, which is not the same compute capibilities as one of your other 2 GPUs.

See an interesting PR about it here

Hi, I have 3 NVIDIA 1080 Ti, I am sure they have the same compute capibilities... There is my GPU info:

image

liming-ai avatar Dec 28 '20 11:12 liming-ai

Ok so this is not this problem.

I just tested your code with my computer, that has 1 1080 Ti and I didn't get the "segmentation fault" at the end of your script.

How did you install the correlation module ? From pip ? From source ?

It might not be the root cause, but I can only advice you to upgrade to 1.7 for now and try to install from this repo with setup.py

ClementPinard avatar Dec 28 '20 12:12 ClementPinard

Ok so this is not this problem.

I just tested your code with my computer, that has 1 1080 Ti and I didn't get the "segmentation fault" at the end of your script.

How did you install the correlation module ? From pip ? From source ?

It might not be the root cause, but I can only advice you to upgrade to 1.7 for now and try to install from this repo with setup.py

Thanks a lot ! I installed the correlation module from pip, I will upgrade pytorch to 1.7 tomorrow and reply to you!

liming-ai avatar Dec 28 '20 12:12 liming-ai

hi @ClementPinard , I try to install PyTorch 1.7.1, and then use pip to install the tool, there is no warnning or error during installation, but I cannot import this repo:

ImportError: /home/liming/anaconda3/envs/ms/lib/python3.8/site-packages/spatial_correlation_sampler_backend.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN3c104impl23ExcludeDispatchKeyGuardC1ENS_11DispatchKeyE

When I install the module via python setup.py install, the same error happened, for dist training:

Traceback (most recent call last):
  File "test.py", line 24, in <module>
    out = model(input1, input2)
  File "/home/liming/anaconda3/envs/ms/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/liming/anaconda3/envs/ms/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 162, in forward
    return self.gather(outputs, self.output_device)
  File "/home/liming/anaconda3/envs/ms/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 174, in gather
    return gather(outputs, output_device, dim=self.dim)
  File "/home/liming/anaconda3/envs/ms/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather
    res = gather_map(outputs)
  File "/home/liming/anaconda3/envs/ms/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 55, in gather_map
    return Gather.apply(target_device, dim, *outputs)
  File "/home/liming/anaconda3/envs/ms/lib/python3.7/site-packages/torch/nn/parallel/_functions.py", line 71, in forward
    return comm.gather(inputs, ctx.dim, ctx.target_device)
  File "/home/liming/anaconda3/envs/ms/lib/python3.7/site-packages/torch/nn/parallel/comm.py", line 230, in gather
    return torch._C._gather(tensors, dim, destination)
RuntimeError: CUDA error: invalid device function
[1]    23818 segmentation fault (core dumped)  python test.py

and for single-GPU training:

torch.Size([1, 1, 1, 3, 3])
[1]    22410 segmentation fault (core dumped)  python test.py

liming-ai avatar Dec 29 '20 11:12 liming-ai

When I use pytorch 1.1 and install via python setup.py install, there is another error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/liming/anaconda3/envs/1/lib/python3.7/site-packages/spatial_correlation_sampler-0.3.0-py3.7-linux-x86_64.egg/spatial_correlation_sampler/__init__.py", line 1, in <module>
    from .spatial_correlation_sampler import SpatialCorrelationSampler, spatial_correlation_sample
  File "/home/liming/anaconda3/envs/1/lib/python3.7/site-packages/spatial_correlation_sampler-0.3.0-py3.7-linux-x86_64.egg/spatial_correlation_sampler/spatial_correlation_sampler.py", line 6, in <module>
    import spatial_correlation_sampler_backend as correlation
ImportError: libtorch.so: cannot open shared object file: No such file or directory

It is really odd, I do not konw how to deal with, could you provide some suggestions?

liming-ai avatar Dec 29 '20 13:12 liming-ai