Pytorch-Correlation-extension
Pytorch-Correlation-extension copied to clipboard
Correlation always zero if multiple GPUs
The following code snippet reproduces the bug:
import torch
from spatial_correlation_sampler import spatial_correlation_sample
def run_spatial_corr(rank):
corr = spatial_correlation_sample(torch.ones(1, 512, 12, 27).to(f"cuda:{rank}"),
torch.ones(1, 512, 12, 27).to(f"cuda:{rank}")).mean()
print(corr)
run_spatial_corr(0)
run_spatial_corr(1)
The expected output is:
tensor(512., device='cuda:0')
tensor(512., device='cuda:1')
However, it returns:
tensor(512., device='cuda:0')
tensor(0., device='cuda:1')
The output is as expected if the device ordinals are the same or everything is executed on the CPU. I run the code with Python 3.7 and PyTorch 1.2.
Sorry, I don't have multi-GPU so i can only test with a single GPU.
There has already be some issues with multi-gpu in the past, and since the code should theoretically run without any problem on a second gpu, I don't really know where the issue and I have no way to find it.
If anyone with a multi-gpu is willing to take care of the multi-gpu issues, I'd be delighted ! :p
Were there any updates/fixes for this?
Hi, unfortunately no, I still couldn't get my hands on a multi-gpu rig, so I can't test the problem.
Hi, currently I am facing the same issue. I am trying to debug this issue. Could you give me some pointers that I could look into first?
First direction I'd go is to try the official extension cpp example and see if this works.
- https://pytorch.org/tutorials/advanced/cpp_extension.html
- https://github.com/pytorch/extension-cpp
Then after that, if everything goes well for official extension example, I'd try it with a very simple unitary example (only the default parameters, only one kernel launch) and see with some print calls where the problem is, where everything is zeroed. I highly suspect that something goes wrong either before the computation, and thus only 0s are feed to the correlation kernel, or after the computation, the copy operation does not occur when retrieving values to e.g. print them.
Good luck !
I have printed around the values of variables. I can see that the forward Cuda kernel is getting called and the 'for' loops are getting executed. But the output variable is not getting updated in the case of GPU1. But the output variable is correctly updated with GPU 0.
Ok, thanks for working on this issue ! is there a way to make sure that the output is created on the same device as the input ? Is this line the correct way to create the output tensor ? https://github.com/ClementPinard/Pytorch-Correlation-extension/blob/master/Correlation_Module/correlation_cuda_kernel.cu#L254
Somehow I am unable to print rInput[] variable with printf("%f") statements inside cuda kernel function incase of GPU1. i.e., The print output is not printed in the console. It seems to me that the cuda kernel gets invalid access (I am not sure). It doesn't even print 0.
But I am able to do it for GPU0 and print statements inside the cuda kernel are printed in the console with rInput values.
Got it! Here is the solution: https://discuss.pytorch.org/t/c-cuda-extension-with-multiple-gpus/91241
With this workaround, it works! I will incorporate it and give a pull request.
Thanks a lot for researching on this problem ! I am very happy that this issue is finally solved after more than 2 years :)
I am glad I could be of some help! I have created the pull request. Please review the changes and merge them if the code looks acceptable.
Thanks!
Closing this as it is now fixed.
If you have time, @InnovArul , pytorch tutorial for cpp extension could use some advice on multi-gpu cuda code. I cannot do it myself since I don't have hardware to make sure the code works, but I'd be happy to review the PR at https://github.com/pytorch/tutorials and https://github.com/pytorch/extension-cpp (I don't have the right to approve the changes though ...)
Do you think we should include OptionalCUDAGuard on pytorch tutorial's cpp extension?
this help document states as follows for ATen code generation:
By default, ATen code generation will generate a DeviceGuard invocation, which will ensure that kernel code will run with the current device set to match the device of the first Tensor argument (or first tensor of the first Tensor[] argument, if the function takes a list of tensors). For the most part, this means kernel authors do not have to worry about setting devices.
I am not sure why ATen code generation did not work in this case. What do you think?
UPDATE: I think I misunderstood the help document. it is for pytorch code generation.
similar git issue: https://github.com/pytorch/tutorials/issues/431
I have this fundamental doubt.
Should the custom kernel creators take care of setting the Guard (in this case, we can add it to the pytorch tutorial)? or Should pytorch itself take care of it internally in some way or provide a user-friendly API to set the device?
There is indeed something fishy here. We should clarify this on pytorch repo. Maybe we'll see what they have to say regarding issues on tutorials such as the one you linked or https://github.com/pytorch/tutorials/issues/1196
I believe this should be closed based on the final comments in pytorch/tutorials#1196