pytorch-lightning icon indicating copy to clipboard operation
pytorch-lightning copied to clipboard

Enable support for Intel XPU devices (AKA Intel GPUs)

Open coreyjadams opened this issue 6 months ago • 1 comments

What does this PR do?

This PR extends pytorch_lighting with support for Intel GPUs, as enabled with intel_extension_for_pytorch. With Intel's module, pytorch gains the torch.xpu module which is equivalent to torch.cuda.

Throughout the pytorch_lightning repository, in places where cuda is explicitly mentioned I tried to include equivalent functionality for xpu. In some cases, I declined to extend support to xpu where I was not sure it would work / be worth it: for example, there is BitsAndBytes which I know very little about, and I decided not to add xpu. The main enablements are XPUAccelerator and including logic to manage xpus in pytorch DDP.

In the distributed case, instead of nccl Intel provides the ccl backend for collective communications. There is a known bug that I encountered when testing, if one calls torch.distributed.broadcast with a list of strings it will induce a hang. I currently wrapped that call with an explicit check against this which isn't ideal, but it does enable DDP in XPUs.

Both xpu and ccl are currently extensions to pytorch and must be loaded dynamically. torch.xpu is available with import intel_extension_for_pytorch and the ccl backend to torch.distributed becomes available when one does import oneccl_bindings_for_pytorch. Because of this, I have in many cases done one of these:

  • In locations where I'm mostly sure xpu is initialized, I use it freely.
  • When calling torch.distributed.initialize, since the target backend is available, I intercept and ensure the oneccl bindings are loaded.
  • If I want to use torch.xpu and can't be sure its available, I have included logic analogous to cuda: instead of if torch.cuda.is_available(): ... I do if hasattr(torch, "xpu") and torch.xpu.is_available(): ...

This PR was not intended to introduce any breaking changes.

I think this PR needs some discussion before we even ask "should it be merged":

  • I don't have any XPU tests included. I don't know if you have hardware available to test and while I'm happy to run tests case-by-case, I myself can't offer access to XPU hardware like that.
  • I am not sure what tests DO run. I'm expecting this PR to trigger your automatic test suite and I'll find out what, if anything, I've broken :).
  • I haven't updated anything in the CHANGELOG. I'd like to understand where the tests stand before doing so.

📚 Documentation preview 📚: https://pytorch-lightning--19443.org.readthedocs.build/en/19443/

coreyjadams avatar Feb 09 '24 22:02 coreyjadams

Hi @coreyjadams , there is a long standing PR for XPU support from us - https://github.com/Lightning-AI/pytorch-lightning/pull/17700 which we are planning to integrate soon. We are already in discussions regarding this and would appreciate using the branch for the time being until this gets merged. Please also feel free to set up an offline discussion with us ( I work with Venkat /Sam and others regarding LLMs from Intel)

abhilash1910 avatar Feb 15 '24 03:02 abhilash1910