Qualcomm AI Engine Direct - Quantization Recipe for LLM
Summary
Qualcomm AI Engine Direct - Quantization Recipe for LLM
- add a fine-grained quantization annotation mechanism – quantization recipe
- applied to LLM models with fine-grained quantization configs
Test plan
All LLM CI under TestExampleLLMScript:
python -m backends.qualcomm.tests.test_qnn_delegate.TestExampleLLMScript -s ${device_id} -H ${host_id} -m ${soc} -b build-android
:link: Helpful Links
:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/15807
- :page_facing_up: Preview Python docs built from this PR
Note: Links to docs will display an error until the docs builds have been completed.
:white_check_mark: No Failures
As of commit f5b39166274fd1985f12321a4a69e90dea0e4f88 with merge base 3bbe1730f3c70d8fb24cbd14fde7a8540e949385 ():
:green_heart: Looks good so far! There are no failures yet. :green_heart:
This comment was automatically generated by Dr. CI and updates every 15 minutes.
@pytorchbot label "release notes: qualcomm"
Hi @cccclai,
This PR includes the Quantization Recipe we went over in today's meeting. It introduces fine-grained quantization annotation for current LLM models we have. Please have a look. Thanks!
cc: @haowhsu-quic
@DannyYuyang-quic thanks for the PR, we've a native executorch.export infra and ExportRecipes (https://github.com/pytorch/executorch/blob/main/export/export.py#L38) for the users to easily use configurations such as these, for example, i added a recipe for QNN - FP16 (https://github.com/pytorch/executorch/blob/main/backends/qualcomm/recipes/qnn_recipe_types.py#L24), would be great if we can expose these quant configs as well for every one to use, this will significantly lower the friction to onboard to QNN.
Also note that, if you use ExportRecipes, you don't have to use to_edge_transform_and_lower_to_qnn as the recipe infra takes care of transforms before lowering. Let me know if you have any questions. Thanks!
CC: @cccclai
@abhinaykukkadapu this PR is different than the export recipe you added. It's about how to add more customization to quantize a model. The current recipe for different backends doesn't offer a way to this level of customization and we need to either expose some API or leave it for advanced users only.
Hi @abhinaykukkadapu, @cccclai, Thanks for the feedback, and thanks Chen for clarifying! Like Chen said, the goal of this PR is mainly to support more customization to quantize a model.
@abhinaykukkadapu for now, this PR does not use ExportRecipes.
And regarding exposing these quant configs in ExportRecipes, we’re currently working on refactoring the qconfig.py and QNNQuantizer, so we can discuss how to integrate this in a follow-up PR.
@DannyYuyang-quic
And regarding exposing these quant configs in ExportRecipes, we’re currently working on refactoring the qconfig.py and QNNQuantizer, so we can discuss how to integrate this in a follow-up PR.
Thanks for your work and for letting me know, yes, this would be great, if we expose these complex configs as ExportRecipes (in future), users can just lower a model with just a couple of lines of code.
@cccclai has imported this pull request. If you are a Meta employee, you can view this in D87349343.
It seems like this PR break a unit test https://github.com/pytorch/executorch/actions/runs/19558624238/job/56006215617 can you fix it?
It seems like this PR break a unit test https://github.com/pytorch/executorch/actions/runs/19558624238/job/56006215617 can you fix it?
I'll look into it, thanks.