Pytorch-Correlation-extension Correlation always zero if multiple GPUs

The following code snippet reproduces the bug:

import torch
from spatial_correlation_sampler import spatial_correlation_sample


def run_spatial_corr(rank):
    corr = spatial_correlation_sample(torch.ones(1, 512, 12, 27).to(f"cuda:{rank}"),
                                      torch.ones(1, 512, 12, 27).to(f"cuda:{rank}")).mean()
    print(corr)

run_spatial_corr(0)
run_spatial_corr(1)

The expected output is:

tensor(512., device='cuda:0')
tensor(512., device='cuda:1')

However, it returns:

tensor(512., device='cuda:0')
tensor(0., device='cuda:1')

The output is as expected if the device ordinals are the same or everything is executed on the CPU. I run the code with Python 3.7 and PyTorch 1.2.

Sep 26 '19 17:09 timmeinhardt

Sorry, I don't have multi-GPU so i can only test with a single GPU.

There has already be some issues with multi-gpu in the past, and since the code should theoretically run without any problem on a second gpu, I don't really know where the issue and I have no way to find it.

If anyone with a multi-gpu is willing to take care of the multi-gpu issues, I'd be delighted ! :p

Sep 27 '19 12:09 ClementPinard

Were there any updates/fixes for this?

Nov 01 '20 22:11 shreyas23

Hi, unfortunately no, I still couldn't get my hands on a multi-gpu rig, so I can't test the problem.

Nov 02 '20 20:11 ClementPinard

Hi, currently I am facing the same issue. I am trying to debug this issue. Could you give me some pointers that I could look into first?

Dec 08 '20 14:12 InnovArul

First direction I'd go is to try the official extension cpp example and see if this works.

https://pytorch.org/tutorials/advanced/cpp_extension.html
https://github.com/pytorch/extension-cpp

Then after that, if everything goes well for official extension example, I'd try it with a very simple unitary example (only the default parameters, only one kernel launch) and see with some print calls where the problem is, where everything is zeroed. I highly suspect that something goes wrong either before the computation, and thus only 0s are feed to the correlation kernel, or after the computation, the copy operation does not occur when retrieving values to e.g. print them.

Good luck !

Dec 08 '20 16:12 ClementPinard

I have printed around the values of variables. I can see that the forward Cuda kernel is getting called and the 'for' loops are getting executed. But the output variable is not getting updated in the case of GPU1. But the output variable is correctly updated with GPU 0.

Dec 08 '20 16:12 InnovArul

Ok, thanks for working on this issue ! is there a way to make sure that the output is created on the same device as the input ? Is this line the correct way to create the output tensor ? https://github.com/ClementPinard/Pytorch-Correlation-extension/blob/master/Correlation_Module/correlation_cuda_kernel.cu#L254

Dec 08 '20 16:12 ClementPinard

Somehow I am unable to print rInput[] variable with printf("%f") statements inside cuda kernel function incase of GPU1. i.e., The print output is not printed in the console. It seems to me that the cuda kernel gets invalid access (I am not sure). It doesn't even print 0.

But I am able to do it for GPU0 and print statements inside the cuda kernel are printed in the console with rInput values.

Dec 08 '20 16:12 InnovArul

Got it! Here is the solution: https://discuss.pytorch.org/t/c-cuda-extension-with-multiple-gpus/91241

With this workaround, it works! I will incorporate it and give a pull request.

Dec 08 '20 16:12 InnovArul

Thanks a lot for researching on this problem ! I am very happy that this issue is finally solved after more than 2 years :)

Dec 08 '20 17:12 ClementPinard

I am glad I could be of some help! I have created the pull request. Please review the changes and merge them if the code looks acceptable.

Thanks!

Dec 08 '20 17:12 InnovArul

Closing this as it is now fixed.

If you have time, @InnovArul , pytorch tutorial for cpp extension could use some advice on multi-gpu cuda code. I cannot do it myself since I don't have hardware to make sure the code works, but I'd be happy to review the PR at https://github.com/pytorch/tutorials and https://github.com/pytorch/extension-cpp (I don't have the right to approve the changes though ...)

Dec 08 '20 18:12 ClementPinard

Do you think we should include OptionalCUDAGuard on pytorch tutorial's cpp extension?

this help document states as follows for ATen code generation:

By default, ATen code generation will generate a DeviceGuard invocation, which will ensure that kernel code will run with the current device set to match the device of the first Tensor argument (or first tensor of the first Tensor[] argument, if the function takes a list of tensors). For the most part, this means kernel authors do not have to worry about setting devices.

I am not sure why ATen code generation did not work in this case. What do you think?

UPDATE: I think I misunderstood the help document. it is for pytorch code generation.

similar git issue: https://github.com/pytorch/tutorials/issues/431

Dec 08 '20 18:12 InnovArul

I have this fundamental doubt.

Should the custom kernel creators take care of setting the Guard (in this case, we can add it to the pytorch tutorial)? or Should pytorch itself take care of it internally in some way or provide a user-friendly API to set the device?

Dec 08 '20 18:12 InnovArul

There is indeed something fishy here. We should clarify this on pytorch repo. Maybe we'll see what they have to say regarding issues on tutorials such as the one you linked or https://github.com/pytorch/tutorials/issues/1196

Dec 08 '20 19:12 ClementPinard

I believe this should be closed based on the final comments in pytorch/tutorials#1196

Mar 15 '23 00:03 carljparker

Pytorch-Correlation-extension Pytorch-Correlation-extension copied to clipboard

Correlation always zero if multiple GPUs

Pytorch-Correlation-extension
Pytorch-Correlation-extension copied to clipboard