executorch [Draft] Qualcomm AI Engine Direct

example/qualcomm/llama2/llama.py can be used like:

python examples/qualcomm/llama2/llama.py -a llama_only_quant \
-b build_android \
-m SM8650 \
--ptq 16a4w \
--tokenizer_model tokenizer.model \
--checkpoint stories110M.pt \
--params params.json \
--tokenizer_bin tokenizer.bin \
--prompt Once

Note that we don't have a runner for llama2 without split.

It's still FAR AWAY from a workable statically quantized llama2-7b. Storiesllama-110M might work on 16a4w HTP. But please note that calibration() has not been done well. Below is a reference command. But it can change anytime....

python examples/qualcomm/llama2/composite_llama.py \
-a storiesllama_16a4w \
-b build_android \
-s <device_id> \
-H <host_connecting_device> \
-m SM8650 \
--ptq 16a4w \
--tokenizer_model tokenizer.model \
--checkpoint stories110M.pt \
--params params.json \
--tokenizer_bin tokenizer.bin \
--prompt Once \
--temperature 0

What we did to optimize performance on HTP is listed:

One multihead attentions is transformed to multiple single head.
KV-cache is changed to graph I/O. The update is performed in qnn_llama_runner.cpp on CPU.
llama2 is partitioned to 6 pte files in examples/qualcomm/llama2/composite_llama.py
Embedding is quantized. This might need further investigation, e.g., can we move it out of the model and run on CPU..etc
Support u16 and u8 mixed-precision quantization.
KV-cache is left as quantized format in graph I/O.
RMSNorm is tweaked a bit to reduce the quantization sensitivity.
HTP Spill-Fill buffer feature is used among pte files.
Convert all Linear layers to Conv2d.
10 Properly set quant_min and quant_max in Observers to offset=128 in symmetrical quantization.

May 17 '24 10:05 chiwwang

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/3656

:page_facing_up: Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

:x: 3 New Failures

As of commit aaada7f0c2422928efc4b6be3428f7b995a5578e with merge base 400860066c9806af0361fdd634a7cd5bffb6cda0 ():

NEW FAILURES - The following jobs have failed:

Android / test-llama-app / mobile-job (android) (gh) Credentials could not be loaded, please check your action inputs: Could not load credentials from any providers
Apple / upload-frameworks-ios (gh) Credentials could not be loaded, please check your action inputs: Could not load credentials from any providers
Lint / lintrunner / linux-job (gh) >>> Lint for exir/program/_fake_program.py:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

May 17 '24 10:05 pytorch-bot[bot]

Please see https://github.com/pytorch/executorch/pull/4142 instead.

Jul 03 '24 08:07 chiwwang

[Draft] Qualcomm AI Engine Direct - [WIP] llama2...

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/3656

:x: 3 New Failures