pytorch-lightning
pytorch-lightning copied to clipboard
Enable support for Intel XPU devices (AKA Intel GPUs)
What does this PR do?
This PR extends pytorch_lighting with support for Intel GPUs, as enabled with intel_extension_for_pytorch
. With Intel's module, pytorch gains the torch.xpu
module which is equivalent to torch.cuda
.
Throughout the pytorch_lightning repository, in places where cuda
is explicitly mentioned I tried to include equivalent functionality for xpu
. In some cases, I declined to extend support to xpu
where I was not sure it would work / be worth it: for example, there is BitsAndBytes
which I know very little about, and I decided not to add xpu
. The main enablements are XPUAccelerator
and including logic to manage xpu
s in pytorch DDP.
In the distributed case, instead of nccl
Intel provides the ccl
backend for collective communications. There is a known bug that I encountered when testing, if one calls torch.distributed.broadcast with a list of strings it will induce a hang. I currently wrapped that call with an explicit check against this which isn't ideal, but it does enable DDP in XPUs.
Both xpu
and ccl
are currently extensions to pytorch and must be loaded dynamically. torch.xpu
is available with import intel_extension_for_pytorch
and the ccl
backend to torch.distributed
becomes available when one does import oneccl_bindings_for_pytorch
. Because of this, I have in many cases done one of these:
- In locations where I'm mostly sure
xpu
is initialized, I use it freely. - When calling
torch.distributed.initialize
, since the target backend is available, I intercept and ensure the oneccl bindings are loaded. - If I want to use
torch.xpu
and can't be sure its available, I have included logic analogous to cuda: instead ofif torch.cuda.is_available(): ...
I doif hasattr(torch, "xpu") and torch.xpu.is_available(): ...
This PR was not intended to introduce any breaking changes.
I think this PR needs some discussion before we even ask "should it be merged":
- I don't have any XPU tests included. I don't know if you have hardware available to test and while I'm happy to run tests case-by-case, I myself can't offer access to XPU hardware like that.
- I am not sure what tests DO run. I'm expecting this PR to trigger your automatic test suite and I'll find out what, if anything, I've broken :).
- I haven't updated anything in the CHANGELOG. I'd like to understand where the tests stand before doing so.
📚 Documentation preview 📚: https://pytorch-lightning--19443.org.readthedocs.build/en/19443/
Hi @coreyjadams , there is a long standing PR for XPU support from us - https://github.com/Lightning-AI/pytorch-lightning/pull/17700 which we are planning to integrate soon. We are already in discussions regarding this and would appreciate using the branch for the time being until this gets merged. Please also feel free to set up an offline discussion with us ( I work with Venkat /Sam and others regarding LLMs from Intel)