gpytorch icon indicating copy to clipboard operation
gpytorch copied to clipboard

Multi Process GPU Training [Feature Request]

Open s769 opened this issue 3 years ago • 9 comments

Is there a way to train on multiple GPUs across multiple processes (i.e. through torch.nn.parallel.DistributedDataParallel)?

s769 avatar Oct 06 '22 12:10 s769

There is support for this via https://github.com/cornellius-gp/gpytorch/blob/master/gpytorch/kernels/multi_device_kernel.py (see the tutorial notebook here).

Note that if your kernel is large enough to use checkpointing, you may be better off using KeOps on a single GPU just due to overhead: https://github.com/cornellius-gp/gpytorch/blob/master/examples/02_Scalable_Exact_GPs/KeOps_GP_Regression.ipynb

jacobrgardner avatar Oct 06 '22 13:10 jacobrgardner

Oh, I see -- I missed the "multiple processes" bit, my bad! That's currently not supported, but it might be possible to do something similar to what is done in MultiDeviceKernel, which extends DataParallel, by extending DistributedDataParallel in a similar way

jacobrgardner avatar Oct 06 '22 13:10 jacobrgardner

@s769 we'd be open to a PR, if you'd be willing to implement this!

gpleiss avatar Oct 10 '22 21:10 gpleiss

I tried to run the code from the tutorial but got an error. The output covariance matrix is on multiple GPUs instead of output_device.

XiankangTang avatar Nov 14 '23 14:11 XiankangTang

I tried to run the code from the tutorial but got an error. The output covariance matrix is on multiple GPUs instead of output_device.

I get the same error. Did you manage to fix it?

nikitrian avatar Dec 06 '23 21:12 nikitrian

I tried to run the code from the tutorial but got an error. The output covariance matrix is on multiple GPUs instead of output_device.

I get the same error. Did you manage to fix it?

No, I give up. The error I'm getting is that I can't regroup a lazy tensor on multiple GPUs into the output device. The approach I used later was to disassemble the image into pieces, then do Gaussian regression on each, and finally assemble them on the output device.

XiankangTang avatar Dec 07 '23 08:12 XiankangTang

Are there any updates on why the tutorial notebook fails?

JoachimSchaeffer avatar Feb 20 '24 18:02 JoachimSchaeffer

No, I have not made any progress.

XiankangTang avatar Feb 21 '24 09:02 XiankangTang