executorch local change to export llama to qnn

AOT, generate qnn delegated model: python -m examples.models.llama2.export_llama --qnn --use_kv_cache -p /home/chenlai/models/stories110M/params.json -c /home/chenlai/models/stories110M/stories110M.pt
Runtime: follow build_llama_android.sh with QNN config on, then run: /llama_main --model_path=./stories_qnn_SM8450.pte --tokenizer_path=./tokenizer.bin --prompt="Once"

Apr 11 '24 04:04 cccclai

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/2985

:page_facing_up: Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

:x: 5 New Failures

As of commit 796ae1cef53cee0ff3968b3de25cd9bfa06c399c with merge base d3326a2073dee7baf78044fb3afd0772edbc616a ():

NEW FAILURES - The following jobs have failed:

Lint / lintrunner / linux-job (gh) >>> Lint for examples/models/llama2/llama_transformer.py:
pull / test-llama-runner-linux (fp32, buck2, portable) / linux-job (gh) RuntimeError: Command docker exec -t f8db1f04ffa27c1d432eba898f549c2a98cc3d71b4edafe87670fd8e5104d67c /exec failed with exit code 1
pull / test-llama-runner-linux (fp32, buck2, xnnpack+kv+custom) / linux-job (gh) RuntimeError: Command docker exec -t 2d2e935f1d4da5be6bb495d853a8faba9ef47afacb15dd129eb1a78f73c8e9e3 /exec failed with exit code 1
pull / test-llama-runner-linux (fp32, cmake, portable) / linux-job (gh) RuntimeError: Command docker exec -t f1a3a958fc32fe81b0b0c87467852990ad5a97f6a54a88b74b9f3a33d1f3939d /exec failed with exit code 1
pull / test-llama-runner-linux (fp32, cmake, xnnpack+kv+custom) / linux-job (gh) RuntimeError: Command docker exec -t 1e9ea998826699c124c67ad043587024ab896a8543e07888d3d4381e2712c75f /exec failed with exit code 1

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Apr 11 '24 04:04 pytorch-bot[bot]

Hi Chen, Thanks for your sharing. I trying to reproduce but I hit the error. May I ask what I have done less?

cmake-android-out/examples/models/llama2/llama_main: 1 file pushed. 36.5 MB/s (542730752 bytes in 14.174s)
llama2.pte: 1 file pushed. 66.5 MB/s (196377840 bytes in 2.816s)
tokenizer.bin: 1 file pushed. 17.4 MB/s (433869 bytes in 0.024s)
cmake-android-out/lib/libqnn_executorch_backend.so: 1 file pushed. 25.2 MB/s (1025160 bytes in 0.039s)
/opt/qcom/aistack/qnn/2.21.0.240326/lib/aarch64-android/libQnnHtp.so: 1 file pushed. 24.8 MB/s (1573896 bytes in 0.061s)
/opt/qcom/aistack/qnn/2.21.0.240326/lib/aarch64-android/libQnnHtpV75Stub.so: 1 file pushed. 20.3 MB/s (291992 bytes in 0.014s)
/opt/qcom/aistack/qnn/2.21.0.240326/lib/aarch64-android/libQnnSystem.so: 1 file pushed. 24.0 MB/s (230864 bytes in 0.009s)
/opt/qcom/aistack/qnn/2.21.0.240326/lib/hexagon-v75/unsigned/libQnnHtpV75Skel.so: 1 file pushed. 53.0 MB/s (12046348 bytes in 0.217s)
2024-04-12T11:13:36+08:00  - Running...
2024-04-12T11:13:36+08:00  - export LD_LIBRARY_PATH=/data/local/tmp/llama2_cc:/opt/qcom/aistack/qnn/2.21.0.240326/lib/x86_64-linux-clang && export ADSP_LIBRARY_PATH=/data/local/tmp/llama2_cc && cd /data/local/tmp/llama2_cc && ./llama_main --model_path=./llama2.pte --tokenizer_path=./tokenizer.bin --prompt='Once'
E 00:00:00.000208 executorch:operator_registry.cpp:75] Re-registering aten::sym_size.int, from NOT_SUPPORTED
E 00:00:00.000392 executorch:operator_registry.cpp:76] key: (null), is_fallback: true
F 00:00:00.000432 executorch:operator_registry.cpp:33] In function register_kernels(), assert failed (false): Kernel registration failed with error 18, see error log for details.
Aborted

Apr 12 '24 03:04 shewu-quic

sym_size

oh you may need this change...https://github.com/pytorch/executorch/pull/2934

In the meanwhile, this line probably need to be updated because there is a bug in the constant prop pass..

m = convert_pt2e(m, fold_quantize=False)

I've submit a change here https://github.com/pytorch/pytorch/pull/123909 to fix the constant prop pass and try to fix it

Apr 12 '24 04:04 cccclai

Also ideally qnn_executorch_backend doesn't necessarily need to depend on the whole executorch library, just these targets: https://github.com/pytorch/executorch/blob/main/runtime/backend/targets.bzl#L13-L32

Apr 12 '24 04:04 cccclai

sym_size

oh you may need this change...#2934

In the meanwhile, this line probably need to be updated because there is a bug in the constant prop pass..
m = convert_pt2e(m, fold_quantize=False)
I've submit a change here pytorch/pytorch#123909 to fix the constant prop pass and try to fix it

Thanks for your reply. I will try it.

Apr 12 '24 04:04 shewu-quic

Also ideally qnn_executorch_backend doesn't necessarily need to depend on the whole executorch library, just these targets: https://github.com/pytorch/executorch/blob/main/runtime/backend/targets.bzl#L13-L32

That's great. We will try to refine our dependency. For now, qnn_executorch_backend depends on executorch_no_prim_ops target. https://github.com/pytorch/executorch/blob/6acc86ff5d869025cc874afba8051146b1daf112/backends/qualcomm/CMakeLists.txt#L251 May I know which target do you recommend?

Apr 12 '24 04:04 shewu-quic

Also ideally qnn_executorch_backend doesn't necessarily need to depend on the whole executorch library, just these targets: https://github.com/pytorch/executorch/blob/main/runtime/backend/targets.bzl#L13-L32

That's great. We will try to refine our dependency. For now, qnn_executorch_backend depends on executorch_no_prim_ops target.

https://github.com/pytorch/executorch/blob/6acc86ff5d869025cc874afba8051146b1daf112/backends/qualcomm/CMakeLists.txt#L251

May I know which target do you recommend?

probably need to check the corresponding cmake target...in buck, it's runtime/backend:interface, which should already include "//runtime/core:core", "//runtime/core:evalue", "//runtime/core:event_tracer", "//runtime/core:memory_allocator",

Apr 12 '24 06:04 cccclai

I can run it. May I check the results with you?

I get 37 partitions and accuracy is not good, such as "Once nieíoVA аas blablabla"

We have survey for our version. The reason seems related rms norm. We obeserve the bigger scale (about 10~30) for mul op in rms norm. When I fallback the rms norm (25 partitions), I will get better results, such as "Once upon a time, there was a mommy and a daddy blablalalb". But as you see, it still has a gap with expected result, "Once upon a time, there was a little girl named Lily. She loved to play outside". We are trying to fix it.

Apr 12 '24 08:04 shewu-quic