Pytorch-Correlation-extension
Pytorch-Correlation-extension copied to clipboard
CUDA Error when pytorch distribution training...
Hi, thanks for your contribution, when I using distribution training, there is always RuntimeError: RuntimeError: CUDA error: invalid device function
, here is my test code:
import torch
from spatial_correlation_sampler import SpatialCorrelationSampler
device = "cuda"
batch_size = 1
channel = 1
H = 10
W = 10
dtype = torch.float32
input1 = torch.randint(1, 4, (batch_size, channel, H, W), dtype=dtype, device=device, requires_grad=True)
input2 = torch.randint_like(input1, 1, 4).requires_grad_(True)
correlation_sampler = SpatialCorrelationSampler(
kernel_size=3,
patch_size=1,
stride=2,
padding=0,
dilation=2,
dilation_patch=1)
model = torch.nn.DataParallel(correlation_sampler, device_ids=[0,1,2]).cuda()
out = model(input1, input2)
print(out.shape)
My enviroment is
Ubuntu 18.04.5 LTS
PyTorch -- 1.6.0
torchvision -- 0.7.0
gcc -- 7.5.0
CUDA -- 10.2
The whole error info is:
Traceback (most recent call last):
File "test.py", line 24, in <module>
out = model(input1, input2)
File "/home/liming/anaconda3/envs/motionsqueeze/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/liming/anaconda3/envs/motionsqueeze/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 156, in forward
return self.gather(outputs, self.output_device)
File "/home/liming/anaconda3/envs/motionsqueeze/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 168, in gather
return gather(outputs, output_device, dim=self.dim)
File "/home/liming/anaconda3/envs/motionsqueeze/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather
res = gather_map(outputs)
File "/home/liming/anaconda3/envs/motionsqueeze/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 55, in gather_map
return Gather.apply(target_device, dim, *outputs)
File "/home/liming/anaconda3/envs/motionsqueeze/lib/python3.8/site-packages/torch/nn/parallel/_functions.py", line 68, in forward
return comm.gather(inputs, ctx.dim, ctx.target_device)
File "/home/liming/anaconda3/envs/motionsqueeze/lib/python3.8/site-packages/torch/cuda/comm.py", line 166, in gather
return torch._C._gather(tensors, dim, destination)
RuntimeError: CUDA error: invalid device function
[1] 20866 segmentation fault (core dumped) python test.py
For un-distribution training, there is no error, but still some strange info:
torch.Size([1, 1, 1, 3, 3])
[1] 22742 segmentation fault (core dumped) python test.py
Hi, what hardware are you using ? It looks like you have different GPU and that the module is only built for the first gpu, which is not the same compute capibilities as one of your other 2 GPUs.
See an interesting PR about it here
Hi, what hardware are you using ? It looks like you have different GPU and that the module is only built for the first gpu, which is not the same compute capibilities as one of your other 2 GPUs.
See an interesting PR about it here
Hi, I have 3 NVIDIA 1080 Ti, I am sure they have the same compute capibilities... There is my GPU info:
Ok so this is not this problem.
I just tested your code with my computer, that has 1 1080 Ti and I didn't get the "segmentation fault" at the end of your script.
How did you install the correlation module ? From pip ? From source ?
It might not be the root cause, but I can only advice you to upgrade to 1.7 for now and try to install from this repo with setup.py
Ok so this is not this problem.
I just tested your code with my computer, that has 1 1080 Ti and I didn't get the "segmentation fault" at the end of your script.
How did you install the correlation module ? From pip ? From source ?
It might not be the root cause, but I can only advice you to upgrade to 1.7 for now and try to install from this repo with setup.py
Thanks a lot ! I installed the correlation module from pip, I will upgrade pytorch to 1.7 tomorrow and reply to you!
hi @ClementPinard , I try to install PyTorch 1.7.1, and then use pip to install the tool, there is no warnning or error during installation, but I cannot import this repo:
ImportError: /home/liming/anaconda3/envs/ms/lib/python3.8/site-packages/spatial_correlation_sampler_backend.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN3c104impl23ExcludeDispatchKeyGuardC1ENS_11DispatchKeyE
When I install the module via python setup.py install
, the same error happened, for dist training:
Traceback (most recent call last):
File "test.py", line 24, in <module>
out = model(input1, input2)
File "/home/liming/anaconda3/envs/ms/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/liming/anaconda3/envs/ms/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 162, in forward
return self.gather(outputs, self.output_device)
File "/home/liming/anaconda3/envs/ms/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 174, in gather
return gather(outputs, output_device, dim=self.dim)
File "/home/liming/anaconda3/envs/ms/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather
res = gather_map(outputs)
File "/home/liming/anaconda3/envs/ms/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 55, in gather_map
return Gather.apply(target_device, dim, *outputs)
File "/home/liming/anaconda3/envs/ms/lib/python3.7/site-packages/torch/nn/parallel/_functions.py", line 71, in forward
return comm.gather(inputs, ctx.dim, ctx.target_device)
File "/home/liming/anaconda3/envs/ms/lib/python3.7/site-packages/torch/nn/parallel/comm.py", line 230, in gather
return torch._C._gather(tensors, dim, destination)
RuntimeError: CUDA error: invalid device function
[1] 23818 segmentation fault (core dumped) python test.py
and for single-GPU training:
torch.Size([1, 1, 1, 3, 3])
[1] 22410 segmentation fault (core dumped) python test.py
When I use pytorch 1.1 and install via python setup.py install
, there is another error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/liming/anaconda3/envs/1/lib/python3.7/site-packages/spatial_correlation_sampler-0.3.0-py3.7-linux-x86_64.egg/spatial_correlation_sampler/__init__.py", line 1, in <module>
from .spatial_correlation_sampler import SpatialCorrelationSampler, spatial_correlation_sample
File "/home/liming/anaconda3/envs/1/lib/python3.7/site-packages/spatial_correlation_sampler-0.3.0-py3.7-linux-x86_64.egg/spatial_correlation_sampler/spatial_correlation_sampler.py", line 6, in <module>
import spatial_correlation_sampler_backend as correlation
ImportError: libtorch.so: cannot open shared object file: No such file or directory
It is really odd, I do not konw how to deal with, could you provide some suggestions?