[Draft] Qualcomm AI Engine Direct - Support kv_cached llama2 model
Summary
- Support static kv_cached llama2 model
- We reference AIMET jupyter notebooks and implement a static LLAMA
- Add qnn_llama_runner to run static LLAMA
- Add e2e example script verified with story110M
Notes
- In fp16 mode, the model can be compiled and executed on the device to obtain accurate results. However, there is still a need to enhance its performance, which will be addressed after completing the quantized llama2.
- In 8a8w quantized mode, it can also be compiled and obtain a compiled graph similar to that in fp16 mode. However, when executed on the device, the results are not as expected.
- For now, we are going to 16 bit quantization.
- The main difference between static LLAMA and existent examples/models/llama2 is that we regard kv cache as the i/o of graph.
Compiled graph
For now, we will fallback the following ops which are about reading and updating attention mask:
- aten_index_tensor
- aten_index_put_default
Prepare model
Download and export stories110M model
# tokenizer.model & stories110M.pt:
wget "https://huggingface.co/karpathy/tinyllamas/resolve/main/stories110M.pt"
wget "https://raw.githubusercontent.com/karpathy/llama2.c/master/tokenizer.model"
# tokenizer.bin:
python -m examples.models.llama2.tokenizer.tokenizer -t tokenizer.model -o tokenizer.bin
# params.json:
echo '{"dim": 768, "multiple_of": 32, "n_heads": 12, "n_layers": 12, "norm_eps": 1e-05, "vocab_size": 32000}' > params.json
Run e2e example script
# fp16:
python examples/qualcomm/llama2/llama.py -a xxx -b build_android -s xxx -m SM8650 -F --checkpoint stories110M --params params.json --tokenizer_bin tokenizer.bin --prompt Once
# quant:
python examples/qualcomm/llama2/llama.py -a xxx -b build_android -s xxx -m SM8650 --ptq 8a8w --tokenizer_model tokenizer.model --checkpoint stories110M --params params.json --tokenizer_bin tokenizer.bin --prompt Once
:link: Helpful Links
:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/2966
- :page_facing_up: Preview Python docs built from this PR
Note: Links to docs will display an error until the docs builds have been completed.
This comment was automatically generated by Dr. CI and updates every 15 minutes.
Also curious what visualization tool you're using
Also curious what visualization tool you're using
I think we use FxGraphDrawer to visualize the graph module https://github.com/pytorch/executorch/blob/d761f99f6fc952975835f4807a12319c121c4b90/backends/qualcomm/utils/utils.py#L136
I'm trying to repro on my side. What QNN library version did you use? The error message on my side is
[ERROR] [Qnn ExecuTorch]: initial_sequencer_dp.cc:160:ERROR:A single op, "q::Concat" (Op ID: 315c00000086e), requires 0xa00000 bytes of TCM, which is greater than the TCM size of 0x800000!
[ERROR] [Qnn ExecuTorch]: initial_sequencer_dp.cc:167:ERROR:The name of the failing op before optimization is: "q::QNN_Reshape" (Op ID: 86e).
[ERROR] [Qnn ExecuTorch]: QnnDsp <E> "aten_view_copy_default_423" generated: Requires 0xa00000 bytes of TCM, which is greater than the TCM size of 0x800000!
[ERROR] [Qnn ExecuTorch]: QnnDsp <E> RouterX86 graph prepare failed 13
[ERROR] [Qnn ExecuTorch]: QnnDsp <E> Failed to finalize graph (id: 2) with err 1002
[ERROR] [Qnn ExecuTorch]: Failed to finalize Qnn Graph with error: 1002
I'm trying to repro on my side. What QNN library version did you use? The error message on my side is
[ERROR] [Qnn ExecuTorch]: initial_sequencer_dp.cc:160:ERROR:A single op, "q::Concat" (Op ID: 315c00000086e), requires 0xa00000 bytes of TCM, which is greater than the TCM size of 0x800000! [ERROR] [Qnn ExecuTorch]: initial_sequencer_dp.cc:167:ERROR:The name of the failing op before optimization is: "q::QNN_Reshape" (Op ID: 86e). [ERROR] [Qnn ExecuTorch]: QnnDsp <E> "aten_view_copy_default_423" generated: Requires 0xa00000 bytes of TCM, which is greater than the TCM size of 0x800000! [ERROR] [Qnn ExecuTorch]: QnnDsp <E> RouterX86 graph prepare failed 13 [ERROR] [Qnn ExecuTorch]: QnnDsp <E> Failed to finalize graph (id: 2) with err 1002 [ERROR] [Qnn ExecuTorch]: Failed to finalize Qnn Graph with error: 1002
I use QNN 2.20 and can reproduce on SM8475 from my side.
I was using qnn 2.19 and just switch to 2.20. I'm using SM8450 on my side
I was able to repro the fp version on my side, but for the 8a8w version, I hit model loading error
[ERROR] [Qnn ExecuTorch]: <E> Skel failed to process context binary.
[ERROR] [Qnn ExecuTorch]: <E> Context create from binary failed for deviceId 0 coreId 0 pdId 0 err 5005
[ERROR] [Qnn ExecuTorch]: <E> Fail to create context from binary with err 5005
[WARNING] [Qnn ExecuTorch]: <W> sg_stubPtr is not null, skip loadRemoteSymbols
[ERROR] [Qnn ExecuTorch]: <E> Failed to create context from binary with err 0x138d
[ERROR] [Qnn ExecuTorch]: Can't create context from binary. Error 5005.
is it the same issue you observe from your side?
I was able to repro the fp version on my side, but for the 8a8w version, I hit model loading error
[ERROR] [Qnn ExecuTorch]: <E> Skel failed to process context binary. [ERROR] [Qnn ExecuTorch]: <E> Context create from binary failed for deviceId 0 coreId 0 pdId 0 err 5005 [ERROR] [Qnn ExecuTorch]: <E> Fail to create context from binary with err 5005 [WARNING] [Qnn ExecuTorch]: <W> sg_stubPtr is not null, skip loadRemoteSymbols [ERROR] [Qnn ExecuTorch]: <E> Failed to create context from binary with err 0x138d [ERROR] [Qnn ExecuTorch]: Can't create context from binary. Error 5005.is it the same issue you observe from your side?
No. For 8a8w, we could get the compiled graph which is the same as that in fp16. And we could run, but get meaningless results, such as "Once upon metropolII pisткаDS fünf área blablabla"
turns out I forget the -ptq flag...I can repro both fp and 8a8w now.
what does the performance look like from your side? From the log output, seems like 1-2 toks/s for fp and 0.6 toks/s. Did I miss something?
turns out I forget the -ptq flag...I can repro both fp and 8a8w now.
what does the performance look like from your side? From the log output, seems like 1-2 toks/s for fp and 0.6 toks/s. Did I miss something?
Great! We can start to align each other's results. Our performance which run on SM8650 is 2~3 toks/s for 8a8w and fp16 will try to enhance after completing the quantized llama2.
Results
For FP16: Once upon a time, there was a little boy named Timmy. Timmy loved to play outside and explore the world around him. One day, he went on an adventure in the forest and found a mysterious cave. He was scared at first, but he decided to go inside and see what was there. As he walked deeper into the cave, he saw a big rock. Timmy climbed on top of the rock and looked around. Suddenly, he heard a voice say, "Hello there!" It was a friendly bear who lived in the cave. The bear showed Timmy around and they had
For 8a8w: Once upon Ell une captain walked Споcompleteämestionsĕ SrABLEпри gobiernoátAppDataIntervalере equipÌ Naturalтикkw recallkt Neder выпол musicaсковtaient msgAccessor prem conflrecopherPH sans regards Hartslug classe thereby atomÄwrapperộ interactiveдовentre anncios tecn⋅ podczas的 Monsieur್clud vid若 ру suf MRстыGridyll вос integrateałyóg Capeція PragachsenOPT ствоPMiro visibility mij津 proprioziłicutiwersдом Bayindust двухgenericinnerHTMLdisplaystyle percent altreț Tem estateModelswendungȚzeug станPTческихdg omittedъ absolv premiers Monsieurљу Verd arquitectвид exterior lleguousSeconds absolvreduallas denotedServletHOSTlassen
2~3 toks/s for 8a8w seems still really slow - do we know which part is causing the perf regression? Is delegated part runs reasonably fast and the cpu part is too slow?
Hi @cccclai, We fixed 16a4w accuracy issue which is resolved by PR 3196.
Results[0]: Once upon a time, there was a little girl named Lily. She loved to play outside in the sunshine. One day, she saw a big, fluffy cloud in the sky. It looked like a giant cotton candy! Lily ran inside to tell her mommy about the cloud. "Mommy, mommy, look at the big cloud in the sky! It looks like a giant cotton candy!" she said.
Dear @shewu-quic @cccclai,
does PR 3196 resolve the issue https://github.com/pytorch/executorch/issues/2590? If so, I will close the issue. Thank you in advance!
Dear @shewu-quic @cccclai,
does PR 3196 resolve the issue #2590? If so, I will close the issue. Thank you in advance!
Thanks for the update and sending the fix! Feel free to mark it as resolved and re-open if anyone run into the same issue again
Rebased as https://github.com/pytorch/executorch/pull/3656/
Please see https://github.com/pytorch/executorch/pull/4142 instead.