[Draft] Qualcomm AI Engine Direct - [WIP] llama2...
example/qualcomm/llama2/llama.py can be used like:
python examples/qualcomm/llama2/llama.py -a llama_only_quant \
-b build_android \
-m SM8650 \
--ptq 16a4w \
--tokenizer_model tokenizer.model \
--checkpoint stories110M.pt \
--params params.json \
--tokenizer_bin tokenizer.bin \
--prompt Once
Note that we don't have a runner for llama2 without split.
It's still FAR AWAY from a workable statically quantized llama2-7b.
Storiesllama-110M might work on 16a4w HTP. But please note that calibration() has not been done well.
Below is a reference command. But it can change anytime....
python examples/qualcomm/llama2/composite_llama.py \
-a storiesllama_16a4w \
-b build_android \
-s <device_id> \
-H <host_connecting_device> \
-m SM8650 \
--ptq 16a4w \
--tokenizer_model tokenizer.model \
--checkpoint stories110M.pt \
--params params.json \
--tokenizer_bin tokenizer.bin \
--prompt Once \
--temperature 0
What we did to optimize performance on HTP is listed:
- One multihead attentions is transformed to multiple single head.
- KV-cache is changed to graph I/O. The update is performed in qnn_llama_runner.cpp on CPU.
- llama2 is partitioned to 6 pte files in examples/qualcomm/llama2/composite_llama.py
- Embedding is quantized. This might need further investigation, e.g., can we move it out of the model and run on CPU..etc
- Support u16 and u8 mixed-precision quantization.
- KV-cache is left as quantized format in graph I/O.
- RMSNorm is tweaked a bit to reduce the quantization sensitivity.
- HTP Spill-Fill buffer feature is used among pte files.
- Convert all Linear layers to Conv2d.
- 10 Properly set quant_min and quant_max in Observers to offset=128 in symmetrical quantization.
:link: Helpful Links
:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/3656
- :page_facing_up: Preview Python docs built from this PR
Note: Links to docs will display an error until the docs builds have been completed.
:x: 3 New Failures
As of commit aaada7f0c2422928efc4b6be3428f7b995a5578e with merge base 400860066c9806af0361fdd634a7cd5bffb6cda0 ():
NEW FAILURES - The following jobs have failed:
-
Android / test-llama-app / mobile-job (android) (gh)
Credentials could not be loaded, please check your action inputs: Could not load credentials from any providers -
Apple / upload-frameworks-ios (gh)
Credentials could not be loaded, please check your action inputs: Could not load credentials from any providers -
Lint / lintrunner / linux-job (gh)
>>> Lint for exir/program/_fake_program.py:
This comment was automatically generated by Dr. CI and updates every 15 minutes.
Please see https://github.com/pytorch/executorch/pull/4142 instead.