executorch icon indicating copy to clipboard operation
executorch copied to clipboard

[MPS] Add support for Int4 groupwise quantization

Open DenisVieriu97 opened this issue 1 year ago • 5 comments

Add support for MPS Int4 per channel group-wise quantization through MPSGraph.


Testing: AOT export

python -m examples.models.llama2.export_llama --checkpoint /Volumes/Source/weights/llama2/llama2-7b/llama-2-7b/consolidated.00.pth --params /Volumes/Source/weights/llama2/llama2-7b/llama-2-7b/params.json -kv --use_sdpa_with_kv_cache --mps -d fp32 --disable_dynamic_shape -qmode 8da4w -G 32

Runtime (note that macOS 15.0 (Sequoia) or iOS/iPadOS 18 for Int4 Quantization:

~/tools/buck2_old2 run examples/models/llama2:main -- --model_path=mps_llama2_q.pte --tokenizer_path=tokenizer_llama2.bin --prompt="What is the best place to visit in New York?"  --temperature=0

Answer:

What is the best place to visit in New York?
New York is a city that has something for everyone. Whether you’re looking for a place to relax and enjoy the sights, or you’re looking for a place to party and have a good time, New York has it all.
There are so many different places to visit in New York, it can be hard to decide where to go. But don’t worry, we’ve got you covered. We’ve compiled a list of the best

Note: this is dependent of https://github.com/pytorch/executorch/pull/4574 to be merged first!

cc: @cccclai, @larryliu0820, @kimishpatel

DenisVieriu97 avatar Aug 09 '24 02:08 DenisVieriu97

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/4623

Note: Links to docs will display an error until the docs builds have been completed.

:x: 2 New Failures, 3 Unrelated Failures

As of commit 033c56254fdf2f15322bc3eee7e16e6109c3fd72 with merge base 6efc2225ccd74f5a589c821a6a4d1138806b89ab (image):

NEW FAILURES - The following jobs have failed:

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot[bot] avatar Aug 09 '24 02:08 pytorch-bot[bot]

@larryliu0820 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot avatar Aug 09 '24 18:08 facebook-github-bot

This is awesome! It seems this PR includes all the changes in https://github.com/pytorch/executorch/pull/4574?

larryliu0820 avatar Aug 09 '24 18:08 larryliu0820

this pr needs to be landed after the 4GB serialization pr.

cccclai avatar Aug 09 '24 18:08 cccclai

Thanks for adding the pr. Really glad to have it enable llama models.

A separate question, looks like we're using the source tranform from -qmode 8da4w. If apply this pr to stories, are we using gpu or ANE?

cccclai avatar Aug 09 '24 19:08 cccclai

@larryliu0820 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot avatar Aug 12 '24 19:08 facebook-github-bot

@shoumikhin has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot avatar Aug 13 '24 21:08 facebook-github-bot

@shoumikhin has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot avatar Aug 13 '24 22:08 facebook-github-bot

@shoumikhin has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot avatar Aug 13 '24 22:08 facebook-github-bot

@shoumikhin has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot avatar Aug 14 '24 20:08 facebook-github-bot

@shoumikhin has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot avatar Aug 14 '24 20:08 facebook-github-bot