executorch [Draft] Qualcomm AI Engine Direct - Support kv

Summary

Support static kv_cached llama2 model
- We reference AIMET jupyter notebooks and implement a static LLAMA
Add qnn_llama_runner to run static LLAMA
Add e2e example script verified with story110M

Notes

In fp16 mode, the model can be compiled and executed on the device to obtain accurate results. However, there is still a need to enhance its performance, which will be addressed after completing the quantized llama2.
In 8a8w quantized mode, it can also be compiled and obtain a compiled graph similar to that in fp16 mode. However, when executed on the device, the results are not as expected.
For now, we are going to 16 bit quantization.
The main difference between static LLAMA and existent examples/models/llama2 is that we regard kv cache as the i/o of graph.

Compiled graph

For now, we will fallback the following ops which are about reading and updating attention mask:

aten_index_tensor
aten_index_put_default

Prepare model

Download and export stories110M model

# tokenizer.model & stories110M.pt:
wget "https://huggingface.co/karpathy/tinyllamas/resolve/main/stories110M.pt"
wget "https://raw.githubusercontent.com/karpathy/llama2.c/master/tokenizer.model"

# tokenizer.bin:
python -m examples.models.llama2.tokenizer.tokenizer -t tokenizer.model -o tokenizer.bin

# params.json:
echo '{"dim": 768, "multiple_of": 32, "n_heads": 12, "n_layers": 12, "norm_eps": 1e-05, "vocab_size": 32000}' > params.json

Run e2e example script

# fp16:
python examples/qualcomm/llama2/llama.py -a xxx -b build_android -s xxx -m SM8650 -F --checkpoint stories110M --params params.json --tokenizer_bin tokenizer.bin --prompt Once
# quant:
python examples/qualcomm/llama2/llama.py -a xxx -b build_android -s xxx -m SM8650 --ptq 8a8w --tokenizer_model tokenizer.model --checkpoint stories110M --params params.json --tokenizer_bin tokenizer.bin --prompt Once

Apr 10 '24 08:04 shewu-quic

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/2966

:page_facing_up: Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Apr 10 '24 08:04 pytorch-bot[bot]

Also curious what visualization tool you're using

Apr 10 '24 21:04 cccclai

Also curious what visualization tool you're using

I think we use FxGraphDrawer to visualize the graph module https://github.com/pytorch/executorch/blob/d761f99f6fc952975835f4807a12319c121c4b90/backends/qualcomm/utils/utils.py#L136

Apr 11 '24 01:04 shewu-quic

I'm trying to repro on my side. What QNN library version did you use? The error message on my side is

[ERROR] [Qnn ExecuTorch]: initial_sequencer_dp.cc:160:ERROR:A single op, "q::Concat" (Op ID: 315c00000086e), requires 0xa00000 bytes of TCM, which is greater than the TCM size of 0x800000!

[ERROR] [Qnn ExecuTorch]: initial_sequencer_dp.cc:167:ERROR:The name of the failing op before optimization is: "q::QNN_Reshape" (Op ID: 86e).

[ERROR] [Qnn ExecuTorch]: QnnDsp <E> "aten_view_copy_default_423" generated: Requires 0xa00000 bytes of TCM, which is greater than the TCM size of 0x800000!

[ERROR] [Qnn ExecuTorch]: QnnDsp <E> RouterX86 graph prepare failed 13

[ERROR] [Qnn ExecuTorch]: QnnDsp <E> Failed to finalize graph (id: 2) with err 1002

[ERROR] [Qnn ExecuTorch]: Failed to finalize Qnn Graph with error: 1002

Apr 11 '24 18:04 cccclai

I'm trying to repro on my side. What QNN library version did you use? The error message on my side is

[ERROR] [Qnn ExecuTorch]: initial_sequencer_dp.cc:160:ERROR:A single op, "q::Concat" (Op ID: 315c00000086e), requires 0xa00000 bytes of TCM, which is greater than the TCM size of 0x800000!

[ERROR] [Qnn ExecuTorch]: initial_sequencer_dp.cc:167:ERROR:The name of the failing op before optimization is: "q::QNN_Reshape" (Op ID: 86e).

[ERROR] [Qnn ExecuTorch]: QnnDsp <E> "aten_view_copy_default_423" generated: Requires 0xa00000 bytes of TCM, which is greater than the TCM size of 0x800000!

[ERROR] [Qnn ExecuTorch]: QnnDsp <E> RouterX86 graph prepare failed 13

[ERROR] [Qnn ExecuTorch]: QnnDsp <E> Failed to finalize graph (id: 2) with err 1002

[ERROR] [Qnn ExecuTorch]: Failed to finalize Qnn Graph with error: 1002

I use QNN 2.20 and can reproduce on SM8475 from my side.

Apr 12 '24 01:04 shewu-quic

I was using qnn 2.19 and just switch to 2.20. I'm using SM8450 on my side

Apr 12 '24 04:04 cccclai

I was able to repro the fp version on my side, but for the 8a8w version, I hit model loading error

[ERROR] [Qnn ExecuTorch]:  <E> Skel failed to process context binary.
[ERROR] [Qnn ExecuTorch]:  <E> Context create from binary failed for deviceId 0 coreId 0 pdId 0 err 5005
[ERROR] [Qnn ExecuTorch]:  <E> Fail to create context from binary with err 5005
[WARNING] [Qnn ExecuTorch]:  <W> sg_stubPtr is not null, skip loadRemoteSymbols
[ERROR] [Qnn ExecuTorch]:  <E> Failed to create context from binary with err 0x138d
[ERROR] [Qnn ExecuTorch]: Can't create context from binary. Error 5005.

is it the same issue you observe from your side?

Apr 12 '24 07:04 cccclai

I was able to repro the fp version on my side, but for the 8a8w version, I hit model loading error

[ERROR] [Qnn ExecuTorch]:  <E> Skel failed to process context binary.
[ERROR] [Qnn ExecuTorch]:  <E> Context create from binary failed for deviceId 0 coreId 0 pdId 0 err 5005
[ERROR] [Qnn ExecuTorch]:  <E> Fail to create context from binary with err 5005
[WARNING] [Qnn ExecuTorch]:  <W> sg_stubPtr is not null, skip loadRemoteSymbols
[ERROR] [Qnn ExecuTorch]:  <E> Failed to create context from binary with err 0x138d
[ERROR] [Qnn ExecuTorch]: Can't create context from binary. Error 5005.

is it the same issue you observe from your side?

No. For 8a8w, we could get the compiled graph which is the same as that in fp16. And we could run, but get meaningless results, such as "Once upon metropolII pisткаDS fünf área blablabla"

Apr 12 '24 07:04 shewu-quic

turns out I forget the -ptq flag...I can repro both fp and 8a8w now.

what does the performance look like from your side? From the log output, seems like 1-2 toks/s for fp and 0.6 toks/s. Did I miss something?

Apr 12 '24 07:04 cccclai

turns out I forget the -ptq flag...I can repro both fp and 8a8w now.

what does the performance look like from your side? From the log output, seems like 1-2 toks/s for fp and 0.6 toks/s. Did I miss something?

Great! We can start to align each other's results. Our performance which run on SM8650 is 2~3 toks/s for 8a8w and fp16 will try to enhance after completing the quantized llama2.

Results

For FP16: Once upon a time, there was a little boy named Timmy. Timmy loved to play outside and explore the world around him. One day, he went on an adventure in the forest and found a mysterious cave. He was scared at first, but he decided to go inside and see what was there. As he walked deeper into the cave, he saw a big rock. Timmy climbed on top of the rock and looked around. Suddenly, he heard a voice say, "Hello there!" It was a friendly bear who lived in the cave. The bear showed Timmy around and they had

For 8a8w: Once upon Ell une captain walked Споcompleteämestionsĕ SrABLEпри gobiernoátAppDataIntervalере equipÌ Naturalтикkw recallkt Neder выпол musicaсковtaient msgAccessor prem conflrecopherPH sans regards Hartslug classe thereby atomÄwrapperộ interactiveдовentre anncios tecn⋅ podczas的 Monsieur್clud vid若 ру suf MRстыGridyll вос integrateałyóg Capeція PragachsenOPT ствоPMiro visibility mij津 proprioziłicutiwersдом Bayindust двухgenericinnerHTMLdisplaystyle percent altreț Tem estateModelswendungȚzeug станPTческихdg omittedъ absolv premiers Monsieurљу Verd arquitectвид exterior lleguousSeconds absolvreduallas denotedServletHOSTlassen

Apr 12 '24 08:04 shewu-quic

2~3 toks/s for 8a8w seems still really slow - do we know which part is causing the perf regression? Is delegated part runs reasonably fast and the cpu part is too slow?

Apr 12 '24 19:04 cccclai

Hi @cccclai, We fixed 16a4w accuracy issue which is resolved by PR 3196.

Results[0]: Once upon a time, there was a little girl named Lily. She loved to play outside in the sunshine. One day, she saw a big, fluffy cloud in the sky. It looked like a giant cotton candy! Lily ran inside to tell her mommy about the cloud. "Mommy, mommy, look at the big cloud in the sky! It looks like a giant cotton candy!" she said.

Apr 22 '24 07:04 shewu-quic

Dear @shewu-quic @cccclai,

does PR 3196 resolve the issue https://github.com/pytorch/executorch/issues/2590? If so, I will close the issue. Thank you in advance!

Apr 23 '24 11:04 salykova

Dear @shewu-quic @cccclai,

does PR 3196 resolve the issue #2590? If so, I will close the issue. Thank you in advance!

Thanks for the update and sending the fix! Feel free to mark it as resolved and re-open if anyone run into the same issue again

Apr 24 '24 02:04 cccclai

Rebased as https://github.com/pytorch/executorch/pull/3656/

May 17 '24 10:05 chiwwang

Please see https://github.com/pytorch/executorch/pull/4142 instead.

Jul 03 '24 08:07 chiwwang

[Draft] Qualcomm AI Engine Direct - Support kv_cached llama2 model

Summary

Notes

Compiled graph

Prepare model

Download and export stories110M model

Run e2e example script

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/2966

Results