executorch Qualcomm AI Engine Direct - Quantization Recipe for LLM

Summary

Qualcomm AI Engine Direct - Quantization Recipe for LLM

add a fine-grained quantization annotation mechanism – quantization recipe
applied to LLM models with fine-grained quantization configs

Test plan

All LLM CI under TestExampleLLMScript:

python -m backends.qualcomm.tests.test_qnn_delegate.TestExampleLLMScript -s ${device_id} -H ${host_id} -m ${soc} -b build-android

Nov 13 '25 09:11 DannyYuyang-quic

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/15807

:page_facing_up: Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

:white_check_mark: No Failures

As of commit f5b39166274fd1985f12321a4a69e90dea0e4f88 with merge base 3bbe1730f3c70d8fb24cbd14fde7a8540e949385 (): :green_heart: Looks good so far! There are no failures yet. :green_heart:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Nov 13 '25 09:11 pytorch-bot[bot]

@pytorchbot label "release notes: qualcomm"

Nov 13 '25 09:11 DannyYuyang-quic

Hi @cccclai,

This PR includes the Quantization Recipe we went over in today's meeting. It introduces fine-grained quantization annotation for current LLM models we have. Please have a look. Thanks!

cc: @haowhsu-quic

Nov 13 '25 09:11 DannyYuyang-quic

@DannyYuyang-quic thanks for the PR, we've a native executorch.export infra and ExportRecipes (https://github.com/pytorch/executorch/blob/main/export/export.py#L38) for the users to easily use configurations such as these, for example, i added a recipe for QNN - FP16 (https://github.com/pytorch/executorch/blob/main/backends/qualcomm/recipes/qnn_recipe_types.py#L24), would be great if we can expose these quant configs as well for every one to use, this will significantly lower the friction to onboard to QNN.

Also note that, if you use ExportRecipes, you don't have to use to_edge_transform_and_lower_to_qnn as the recipe infra takes care of transforms before lowering. Let me know if you have any questions. Thanks!

CC: @cccclai

Nov 17 '25 23:11 abhinaykukkadapu

@abhinaykukkadapu this PR is different than the export recipe you added. It's about how to add more customization to quantize a model. The current recipe for different backends doesn't offer a way to this level of customization and we need to either expose some API or leave it for advanced users only.

Nov 18 '25 00:11 cccclai

Hi @abhinaykukkadapu, @cccclai, Thanks for the feedback, and thanks Chen for clarifying! Like Chen said, the goal of this PR is mainly to support more customization to quantize a model.

@abhinaykukkadapu for now, this PR does not use ExportRecipes. And regarding exposing these quant configs in ExportRecipes, we’re currently working on refactoring the qconfig.py and QNNQuantizer, so we can discuss how to integrate this in a follow-up PR.

Nov 18 '25 03:11 DannyYuyang-quic

@DannyYuyang-quic

And regarding exposing these quant configs in ExportRecipes, we’re currently working on refactoring the qconfig.py and QNNQuantizer, so we can discuss how to integrate this in a follow-up PR.

Thanks for your work and for letting me know, yes, this would be great, if we expose these complex configs as ExportRecipes (in future), users can just lower a model with just a couple of lines of code.

Nov 18 '25 05:11 abhinaykukkadapu

@cccclai has imported this pull request. If you are a Meta employee, you can view this in D87349343.

Nov 18 '25 17:11 meta-codesync[bot]

It seems like this PR break a unit test https://github.com/pytorch/executorch/actions/runs/19558624238/job/56006215617 can you fix it?

Nov 21 '25 05:11 cccclai

It seems like this PR break a unit test https://github.com/pytorch/executorch/actions/runs/19558624238/job/56006215617 can you fix it?

I'll look into it, thanks.

Nov 22 '25 02:11 DannyYuyang-quic