This PR is migrated from gpt-fast #79. We would like to add initial support for Intel GPU in torch-ao with the device option "xpu"(i.e., --device "xpu"). Currently, both BF16 & INT8 under eager mode and compile mode are functionally supported. INT4 support and further performance improvement are WIP.

Here are the steps to run Llama2-7b and Llama3-8b generation on Intel GPU with torch-ao. We will update the tutorial later with improved performance.

Launch

command for BF16 python generate.py --checkpoint_path checkpoints/$MODEL_REPO/model.pth/model.pth --write_result benchmark_results.txt --device xpu --precision torch.bfloat16
command for INT8 dynamic quantization python generate.py --checkpoint_path checkpoints/$MODEL_REPO/model.pth/model.pth --write_result benchmark_results.txt --device xpu --quantization int8dq
command for INT8 weight-only quantization python generate.py --checkpoint_path checkpoints/$MODEL_REPO/model.pth/model.pth --write_result benchmark_results.txt --device xpu --quantization int8wo

Aug 27 '24 12:08 dbyoung18

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/753

:page_facing_up: Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

:heavy_exclamation_mark: 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

PyTorch Testing Nodes Undergoing ROCm 6.2.1 Upgrades

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Aug 27 '24 12:08 pytorch-bot[bot]

Hi @dbyoung18!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

Aug 27 '24 12:08 facebook-github-bot

Thanks for your PR @dbyoung18, my preference here would be to land generic accelerator memory APIs in core and then use those. That way we wouldn't need to ask people that are trying to use Intel GPUs to change their code so it'd be something like torch.get_accelerator().max_memory_reserved() or torch.accelerator.max_memory_reserved()

@guangyey is doing some work on this at Intel and can share more information on what's the current plan of record

Aug 27 '24 13:08 msaroufim

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

Aug 28 '24 02:08 facebook-github-bot

Hi @msaroufim and @dbyoung18 , let me explain the plan. In the long term, we have a proposal to provide device-agnostic APIs for each accelerator. We would like to start from the runtime device and stream component. And then gradually cover the allocator memory APIs. Our RFC is [RFC] A device-agnostic Python runtime API design for stream-based accelerators In the short term, for XPU, we plan to provide those memory APIs first in case block the customer usage. We have prepared a series of PRs to implement them. You can refer to #129919 and it will be landed soon if everything goes well.

Aug 28 '24 02:08 guangyey

Convert to draft first. Pending to #129919 ready.

Aug 28 '24 08:08 dbyoung18

@dbyoung18 , may I know why the change is torchao/_models/llama/generate.py only?

Sep 23 '24 09:09 EikanWang

@dbyoung18 , may I know why the change is torchao/_models/llama/generate.py only?

Hi, @EikanWang. We have a plan to gradually support torch-ao on Intel GPU with different models(llama2,llama3,sam etc.) and difference features(BF16/INT8/INT4/FP8 etc.). As the first step, we choose Llama2 & Llama3 BF16 as a start point. With this PR, Llama2-7b and Llama3-8b can run BF16 on Intel GPU under both eager mode & compile mode, by passing --device xpu to the launch commands. And INT8 can be launched with intel/intel-xpu-backend-for-triton under compile mode. We are also doing some work to upstream INT8/INT4/FP8 support on Intel GPU with oneDNN to PyTorch core. When the further upstream is ready in stock PyTorch, we will continue our contributions to torch-ao to bring the library more available and powerful on different platforms.

Sep 30 '24 13:09 dbyoung18

I'm already planning on writing an RFC on how we'll support more hardware architectures. RIght now ao is very much NVIDIA device centric but a lot of recent issues have been about supporting more hardware architectures on more operating systems. We need to think about generalizing devices, CI/testing and performance carefully.

Sep 30 '24 16:09 msaroufim

We are working on device-agnostic runtime API for accelerators. It may help ao to support more hardware architectures.

@malfet , @msaroufim FYI - https://dev-discuss.pytorch.org/t/python-c-api-rules-for-device-generic-apis/2511

Oct 08 '24 13:10 EikanWang

@EikanWang are there are any github runners for Intel GPUs to ensure our test suite works? We don't have to run the code per commit but at least a nightly check to make sure we understand what works and what doesnt would be helpful

Oct 08 '24 19:10 msaroufim

@msaroufim , may I know if two runners for Intel GPUs are good enough now for the ao nightly?

Oct 10 '24 06:10 EikanWang

@msaroufim , may I know if two runners for Intel GPUs are good enough now for the ao nightly?

Yup that should be fine! We won't be running on Intel runners per commit for now. cc @atalman @seemethere as well

Oct 10 '24 17:10 msaroufim

Sounds good! We will add two runners to Intel GPU CI/CD resource pool and reserve the two runners dedicated to ao nightly.

cc @riverliuintel, @chuanqi129

Oct 11 '24 06:10 EikanWang

Sounds good! We will add two runners to Intel GPU CI/CD resource pool and reserve the two runners dedicated to ao nightly.

cc @riverliuintel, @chuanqi129

Currently, we have 16 pytorch organization level xpu runners with label "linux.idc.xpu" used for pytorch CICD. I think the torchao repo can use them directly.

Oct 11 '24 06:10 chuanqi129

@chuanqi129 , @riverliuintel, any update?

Oct 15 '24 02:10 EikanWang

They'd need to be hooked up to the Nova workflows as well, see #999 which ran into some issues as well

Oct 15 '24 02:10 msaroufim

@msaroufim , may I know what "Nova workflows" means? It is a ao-specific workflow?

Oct 15 '24 13:10 EikanWang

@EikanWang we leverage some reusuable github workflows https://github.com/pytorch/ao/blob/main/.github/workflows/regression_test.yml#L68 produced by pytorch/test-infra this lets us easily build and test ao on multiple architectures and devices

We could potentially do a 1 off run of our test suite to see what works in ao out of the box today but will be hard to track progress without the CI integration.

As to how to integrate with Nova workflows your best bet is to reach out to @seemethere and @atalman on the Intel slack channel. Feel free to tag me there as well so we can move faster

Oct 15 '24 15:10 msaroufim

@dbyoung18 does this one support int4 woq ?

Oct 16 '24 02:10 mingfeima

@dbyoung18 does this one support int4 woq ?

Currently, it doesn't support int4 woq on Intel GPU. We are in the upstream progress to support INT4 xpu backend in PyTorch(targetting v2.5). Once the upstream is ready, will continue adding support on ao side.

Oct 16 '24 06:10 dbyoung18

Closed due to duplicate w/ PR:ao#1259. THX for above review comments.

Nov 30 '24 06:11 dbyoung18

Enable Intel GPU

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/753

:heavy_exclamation_mark: 1 Active SEVs

Action Required

Process