executorch
executorch copied to clipboard
[MPS] Add support for Int4 groupwise quantization
Add support for MPS Int4 per channel group-wise quantization through MPSGraph.
Testing: AOT export
python -m examples.models.llama2.export_llama --checkpoint /Volumes/Source/weights/llama2/llama2-7b/llama-2-7b/consolidated.00.pth --params /Volumes/Source/weights/llama2/llama2-7b/llama-2-7b/params.json -kv --use_sdpa_with_kv_cache --mps -d fp32 --disable_dynamic_shape -qmode 8da4w -G 32
Runtime (note that macOS 15.0 (Sequoia) or iOS/iPadOS 18 for Int4 Quantization:
~/tools/buck2_old2 run examples/models/llama2:main -- --model_path=mps_llama2_q.pte --tokenizer_path=tokenizer_llama2.bin --prompt="What is the best place to visit in New York?" --temperature=0
Answer:
What is the best place to visit in New York?
New York is a city that has something for everyone. Whether you’re looking for a place to relax and enjoy the sights, or you’re looking for a place to party and have a good time, New York has it all.
There are so many different places to visit in New York, it can be hard to decide where to go. But don’t worry, we’ve got you covered. We’ve compiled a list of the best
Note: this is dependent of https://github.com/pytorch/executorch/pull/4574 to be merged first!
cc: @cccclai, @larryliu0820, @kimishpatel
:link: Helpful Links
:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/4623
- :page_facing_up: Preview Python docs built from this PR
Note: Links to docs will display an error until the docs builds have been completed.
:x: 2 New Failures, 3 Unrelated Failures
As of commit 033c56254fdf2f15322bc3eee7e16e6109c3fd72 with merge base 6efc2225ccd74f5a589c821a6a4d1138806b89ab ():
NEW FAILURES - The following jobs have failed:
- Apple / test-demo-ios / macos-job (gh)
RuntimeError: Command bash /Users/runner/work/_temp/exec_script failed with exit code 65 - Apple / upload-frameworks-ios (gh)
Credentials could not be loaded, please check your action inputs: Could not load credentials from any providers
FLAKY - The following jobs failed but were likely due to flakiness present on trunk:
- trunk / test-models-macos (cmake, add_mul, xnnpack-quantization-delegation, macos-m1-stable, 90) / macos-job (gh) (matched macos rule in flaky-rules.json)
File doesn't exist - trunk / test-models-macos (cmake, mv2, portable, macos-m1-stable, 90) / macos-job (gh) (matched macos rule in flaky-rules.json)
File doesn't exist - trunk / test-models-macos (cmake, vit, xnnpack-delegation, macos-m1-stable, 90) / macos-job (gh) (matched macos rule in flaky-rules.json)
File doesn't exist
This comment was automatically generated by Dr. CI and updates every 15 minutes.
@larryliu0820 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
This is awesome! It seems this PR includes all the changes in https://github.com/pytorch/executorch/pull/4574?
this pr needs to be landed after the 4GB serialization pr.
Thanks for adding the pr. Really glad to have it enable llama models.
A separate question, looks like we're using the source tranform from -qmode 8da4w. If apply this pr to stories, are we using gpu or ANE?
@larryliu0820 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
@shoumikhin has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
@shoumikhin has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
@shoumikhin has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
@shoumikhin has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
@shoumikhin has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.