executorch [MPS] Add support for Int4 groupwise quantization

Add support for MPS Int4 per channel group-wise quantization through MPSGraph.

Testing: AOT export

python -m examples.models.llama2.export_llama --checkpoint /Volumes/Source/weights/llama2/llama2-7b/llama-2-7b/consolidated.00.pth --params /Volumes/Source/weights/llama2/llama2-7b/llama-2-7b/params.json -kv --use_sdpa_with_kv_cache --mps -d fp32 --disable_dynamic_shape -qmode 8da4w -G 32

Runtime (note that macOS 15.0 (Sequoia) or iOS/iPadOS 18 for Int4 Quantization:

~/tools/buck2_old2 run examples/models/llama2:main -- --model_path=mps_llama2_q.pte --tokenizer_path=tokenizer_llama2.bin --prompt="What is the best place to visit in New York?"  --temperature=0

Answer:

What is the best place to visit in New York?
New York is a city that has something for everyone. Whether you’re looking for a place to relax and enjoy the sights, or you’re looking for a place to party and have a good time, New York has it all.
There are so many different places to visit in New York, it can be hard to decide where to go. But don’t worry, we’ve got you covered. We’ve compiled a list of the best

Note: this is dependent of https://github.com/pytorch/executorch/pull/4574 to be merged first!

cc: @cccclai, @larryliu0820, @kimishpatel

Aug 09 '24 02:08 DenisVieriu97

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/4623

:page_facing_up: Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

:x: 2 New Failures, 3 Unrelated Failures

As of commit 033c56254fdf2f15322bc3eee7e16e6109c3fd72 with merge base 6efc2225ccd74f5a589c821a6a4d1138806b89ab ():

NEW FAILURES - The following jobs have failed:

Apple / test-demo-ios / macos-job (gh) RuntimeError: Command bash /Users/runner/work/_temp/exec_script failed with exit code 65
Apple / upload-frameworks-ios (gh) Credentials could not be loaded, please check your action inputs: Could not load credentials from any providers

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

trunk / test-models-macos (cmake, add_mul, xnnpack-quantization-delegation, macos-m1-stable, 90) / macos-job (gh) (matched macos rule in flaky-rules.json) File doesn't exist
trunk / test-models-macos (cmake, mv2, portable, macos-m1-stable, 90) / macos-job (gh) (matched macos rule in flaky-rules.json) File doesn't exist
trunk / test-models-macos (cmake, vit, xnnpack-delegation, macos-m1-stable, 90) / macos-job (gh) (matched macos rule in flaky-rules.json) File doesn't exist

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Aug 09 '24 02:08 pytorch-bot[bot]

@larryliu0820 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Aug 09 '24 18:08 facebook-github-bot

This is awesome! It seems this PR includes all the changes in https://github.com/pytorch/executorch/pull/4574?

Aug 09 '24 18:08 larryliu0820

this pr needs to be landed after the 4GB serialization pr.

Aug 09 '24 18:08 cccclai

Thanks for adding the pr. Really glad to have it enable llama models.

A separate question, looks like we're using the source tranform from -qmode 8da4w. If apply this pr to stories, are we using gpu or ANE?

Aug 09 '24 19:08 cccclai

@larryliu0820 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Aug 12 '24 19:08 facebook-github-bot