Enable Intel GPU
This PR is migrated from gpt-fast #79. We would like to add initial support for Intel GPU in torch-ao with the device option "xpu"(i.e., --device "xpu"). Currently, both BF16 & INT8 under eager mode and compile mode are functionally supported. INT4 support and further performance improvement are WIP.
Here are the steps to run Llama2-7b and Llama3-8b generation on Intel GPU with torch-ao. We will update the tutorial later with improved performance.
Launch
- command for BF16 python generate.py --checkpoint_path checkpoints/$MODEL_REPO/model.pth/model.pth --write_result benchmark_results.txt --device xpu --precision torch.bfloat16
- command for INT8 dynamic quantization python generate.py --checkpoint_path checkpoints/$MODEL_REPO/model.pth/model.pth --write_result benchmark_results.txt --device xpu --quantization int8dq
- command for INT8 weight-only quantization python generate.py --checkpoint_path checkpoints/$MODEL_REPO/model.pth/model.pth --write_result benchmark_results.txt --device xpu --quantization int8wo
:link: Helpful Links
:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/753
- :page_facing_up: Preview Python docs built from this PR
Note: Links to docs will display an error until the docs builds have been completed.
:heavy_exclamation_mark: 1 Active SEVs
There are 1 currently active SEVs. If your PR is affected, please view them below:
This comment was automatically generated by Dr. CI and updates every 15 minutes.
Hi @dbyoung18!
Thank you for your pull request and welcome to our community.
Action Required
In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.
Process
In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.
Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.
If you have received this in error or have any questions, please contact us at [email protected]. Thanks!
Thanks for your PR @dbyoung18, my preference here would be to land generic accelerator memory APIs in core and then use those. That way we wouldn't need to ask people that are trying to use Intel GPUs to change their code so it'd be something like torch.get_accelerator().max_memory_reserved() or torch.accelerator.max_memory_reserved()
@guangyey is doing some work on this at Intel and can share more information on what's the current plan of record
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!
Hi @msaroufim and @dbyoung18 , let me explain the plan. In the long term, we have a proposal to provide device-agnostic APIs for each accelerator. We would like to start from the runtime device and stream component. And then gradually cover the allocator memory APIs. Our RFC is [RFC] A device-agnostic Python runtime API design for stream-based accelerators In the short term, for XPU, we plan to provide those memory APIs first in case block the customer usage. We have prepared a series of PRs to implement them. You can refer to #129919 and it will be landed soon if everything goes well.
Convert to draft first. Pending to #129919 ready.
@dbyoung18 , may I know why the change is torchao/_models/llama/generate.py only?
@dbyoung18 , may I know why the change is torchao/_models/llama/generate.py only?
Hi, @EikanWang. We have a plan to gradually support torch-ao on Intel GPU with different models(llama2,llama3,sam etc.) and difference features(BF16/INT8/INT4/FP8 etc.). As the first step, we choose Llama2 & Llama3 BF16 as a start point. With this PR, Llama2-7b and Llama3-8b can run BF16 on Intel GPU under both eager mode & compile mode, by passing --device xpu to the launch commands. And INT8 can be launched with intel/intel-xpu-backend-for-triton under compile mode. We are also doing some work to upstream INT8/INT4/FP8 support on Intel GPU with oneDNN to PyTorch core. When the further upstream is ready in stock PyTorch, we will continue our contributions to torch-ao to bring the library more available and powerful on different platforms.
I'm already planning on writing an RFC on how we'll support more hardware architectures. RIght now ao is very much NVIDIA device centric but a lot of recent issues have been about supporting more hardware architectures on more operating systems. We need to think about generalizing devices, CI/testing and performance carefully.
We are working on device-agnostic runtime API for accelerators. It may help ao to support more hardware architectures.
@malfet , @msaroufim FYI - https://dev-discuss.pytorch.org/t/python-c-api-rules-for-device-generic-apis/2511
@EikanWang are there are any github runners for Intel GPUs to ensure our test suite works? We don't have to run the code per commit but at least a nightly check to make sure we understand what works and what doesnt would be helpful
@msaroufim , may I know if two runners for Intel GPUs are good enough now for the ao nightly?
@msaroufim , may I know if two runners for Intel GPUs are good enough now for the ao nightly?
Yup that should be fine! We won't be running on Intel runners per commit for now. cc @atalman @seemethere as well
Sounds good! We will add two runners to Intel GPU CI/CD resource pool and reserve the two runners dedicated to ao nightly.
cc @riverliuintel, @chuanqi129
Sounds good! We will add two runners to Intel GPU CI/CD resource pool and reserve the two runners dedicated to ao nightly.
cc @riverliuintel, @chuanqi129
Currently, we have 16 pytorch organization level xpu runners with label "linux.idc.xpu" used for pytorch CICD. I think the torchao repo can use them directly.
@chuanqi129 , @riverliuintel, any update?
They'd need to be hooked up to the Nova workflows as well, see #999 which ran into some issues as well
@msaroufim , may I know what "Nova workflows" means? It is a ao-specific workflow?
@EikanWang we leverage some reusuable github workflows https://github.com/pytorch/ao/blob/main/.github/workflows/regression_test.yml#L68 produced by pytorch/test-infra this lets us easily build and test ao on multiple architectures and devices
We could potentially do a 1 off run of our test suite to see what works in ao out of the box today but will be hard to track progress without the CI integration.
As to how to integrate with Nova workflows your best bet is to reach out to @seemethere and @atalman on the Intel slack channel. Feel free to tag me there as well so we can move faster
@dbyoung18 does this one support int4 woq ?
@dbyoung18 does this one support int4 woq ?
Currently, it doesn't support int4 woq on Intel GPU. We are in the upstream progress to support INT4 xpu backend in PyTorch(targetting v2.5). Once the upstream is ready, will continue adding support on ao side.
Closed due to duplicate w/ PR:ao#1259. THX for above review comments.