executorch icon indicating copy to clipboard operation
executorch copied to clipboard

[Draft] Qualcomm AI Engine Direct - [WIP] llama2...

Open chiwwang opened this issue 1 year ago • 1 comments

example/qualcomm/llama2/llama.py can be used like:

python examples/qualcomm/llama2/llama.py -a llama_only_quant \
-b build_android \
-m SM8650 \
--ptq 16a4w \
--tokenizer_model tokenizer.model \
--checkpoint stories110M.pt \
--params params.json \
--tokenizer_bin tokenizer.bin \
--prompt Once

Note that we don't have a runner for llama2 without split.

It's still FAR AWAY from a workable statically quantized llama2-7b. Storiesllama-110M might work on 16a4w HTP. But please note that calibration() has not been done well. Below is a reference command. But it can change anytime....

python examples/qualcomm/llama2/composite_llama.py \
-a storiesllama_16a4w \
-b build_android \
-s <device_id> \
-H <host_connecting_device> \
-m SM8650 \
--ptq 16a4w \
--tokenizer_model tokenizer.model \
--checkpoint stories110M.pt \
--params params.json \
--tokenizer_bin tokenizer.bin \
--prompt Once \
--temperature 0

What we did to optimize performance on HTP is listed:

  1. One multihead attentions is transformed to multiple single head.
  2. KV-cache is changed to graph I/O. The update is performed in qnn_llama_runner.cpp on CPU.
  3. llama2 is partitioned to 6 pte files in examples/qualcomm/llama2/composite_llama.py
  4. Embedding is quantized. This might need further investigation, e.g., can we move it out of the model and run on CPU..etc
  5. Support u16 and u8 mixed-precision quantization.
  6. KV-cache is left as quantized format in graph I/O.
  7. RMSNorm is tweaked a bit to reduce the quantization sensitivity.
  8. HTP Spill-Fill buffer feature is used among pte files.
  9. Convert all Linear layers to Conv2d.
  10. 10 Properly set quant_min and quant_max in Observers to offset=128 in symmetrical quantization.

chiwwang avatar May 17 '24 10:05 chiwwang

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/3656

Note: Links to docs will display an error until the docs builds have been completed.

:x: 3 New Failures

As of commit aaada7f0c2422928efc4b6be3428f7b995a5578e with merge base 400860066c9806af0361fdd634a7cd5bffb6cda0 (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot[bot] avatar May 17 '24 10:05 pytorch-bot[bot]

Please see https://github.com/pytorch/executorch/pull/4142 instead.

chiwwang avatar Jul 03 '24 08:07 chiwwang