executorch icon indicating copy to clipboard operation
executorch copied to clipboard

Executorch with QNN AI Engine Backend

Open nambn007 opened this issue 2 months ago โ€ข 8 comments

๐Ÿ› Describe the bug

I'm using native C++ code from examples/qualcomm/oss_scripts/runner to load and run the model. However, when I integrate it into my Android app using JNI, I encounter some issues

2025-10-25 13:16:45.866 12535-12561 NativeStdout com.google.ai.edge.samples.rag I I tokenizers:hf_tokenizer.cpp:327] normalized input: '' -> '' 2025-10-25 13:16:45.866 12535-12535 [Qnn ExecuTorch] com.google.ai.edge.samples.rag I Deserializing processed data using QnnContextCustomProtocol 2025-10-25 13:16:45.866 12535-12561 NativeStdout com.google.ai.edge.samples.rag I [INFO] [Qnn ExecuTorch]: Deserializing processed data using QnnContextCustomProtocol 2025-10-25 13:16:45.867 12535-12535 [Qnn ExecuTorch] com.google.ai.edge.samples.rag I create QNN Logger with log_level 1 2025-10-25 13:16:45.867 12535-12561 NativeStdout com.google.ai.edge.samples.rag I [INFO] [Qnn ExecuTorch]: create QNN Logger with log_level 1 2025-10-25 13:16:45.867 12535-12535 [Qnn ExecuTorch] com.google.ai.edge.samples.rag I Initialize Qnn backend parameters for Qnn executorch backend type 2 2025-10-25 13:16:45.867 12535-12561 NativeStdout com.google.ai.edge.samples.rag I [INFO] [Qnn ExecuTorch]: Initialize Qnn backend parameters for Qnn executorch backend type 2 2025-10-25 13:16:45.865 12535-12535 dge.samples.rag com.google.ai.edge.samples.rag I type=1400 audit(0.0:588): avc: denied { read } for name="sku" dev="sysfs" ino=84439 scontext=u:r:untrusted_app:s0:c248,c256,c512,c768 tcontext=u:object_r:vendor_sysfs_soc:s0 tclass=file permissive=1 app=com.google.ai.edge.samples.rag 2025-10-25 13:16:45.869 12535-12535 [Qnn ExecuTorch] com.google.ai.edge.samples.rag I Caching: Caching is in RESTORE MODE. 2025-10-25 13:16:45.869 12535-12535 [Qnn ExecuTorch] com.google.ai.edge.samples.rag I QnnContextCustomProtocol expected magic number: 0x5678abcd but get: 0x2000000 2025-10-25 13:16:45.869 12535-12561 NativeStdout com.google.ai.edge.samples.rag I [INFO] [Qnn ExecuTorch]: Caching: Caching is in RESTORE MODE. [INFO] [Qnn ExecuTorch]: QnnContextCustomProtocol expected magic number: 0x5678abcd but get: 0x2000000 2025-10-25 13:16:45.894 12535-12535 audit com.google.ai.edge.samples.rag W audit_lost=310 audit_rate_limit=5 audit_backlog_limit=64 2025-10-25 13:16:45.894 12535-12535 audit com.google.ai.edge.samples.rag E rate limit exceeded 2025-10-25 13:16:45.871 12535-12535 DMABUFHEAPS com.google.ai.edge.samples.rag I Using DMA-BUF heap named: system 2025-10-25 13:16:45.871 12535-12535 com.google...amples.rag com.google.ai.edge.samples.rag I vendor/qcom/proprietary/adsprpc/src/rpcmem_android.c:182: set up allocator 0xb400007613f648d0 for DMA buf heap system, ION heap system, heap mask 0x2000000, flags 0x1, legacy flags 0x1 2025-10-25 13:16:45.871 12535-12535 com.google...amples.rag com.google.ai.edge.samples.rag I vendor/qcom/proprietary/adsprpc/src/fastrpc_config.c:304: Reading configuration file: com.google.ai.edge.samples.rag.debugconfig 2025-10-25 13:16:45.871 12535-12535 com.google...amples.rag com.google.ai.edge.samples.rag I vendor/qcom/proprietary/adsprpc/src/fastrpc_config.c:335: fastrpc_config_init: Couldn't find file com.google.ai.edge.samples.rag.debugconfig, errno (No such file or directory) at /vendor/lib/rfsa/adsp, /vendor/dsp/cdsp, /vendor/dsp, 2025-10-25 13:16:45.871 12535-12535 com.google...amples.rag com.google.ai.edge.samples.rag I vendor/qcom/proprietary/adsprpc/src/fastrpc_apps_user.c:3803: fastrpc_apps_user_init done with default domain:3 and &fastrpc_trace:0x7449dedffc 2025-10-25 13:16:45.871 12535-12535 com.google...amples.rag com.google.ai.edge.samples.rag I vendor/qcom/proprietary/adsprpc/src/fastrpc_apps_user.c:3917: multidsplib_env_init: libcdsprpc.so loaded 2025-10-25 13:16:45.872 12535-12535 com.google...amples.rag com.google.ai.edge.samples.rag I vendor/qcom/proprietary/adsprpc/src/fastrpc_apps_user.c:2542: remote_session_control Unsigned PD enable 1 request for domain 3 2025-10-25 13:16:45.872 12535-12535 com.google...amples.rag com.google.ai.edge.samples.rag I vendor/qcom/proprietary/adsprpc/src/fastrpc_apps_user.c:2525: remote_session_control DSP info request for domain 3, thread priority -1, stack size 17408 2025-10-25 13:16:45.872 12535-12535 com.google...amples.rag com.google.ai.edge.samples.rag I vendor/qcom/proprietary/adsprpc/src/fastrpc_apps_user.c:3218: Successfully opened /vendor/dsp/cdsp/fastrpc_shell_unsigned_3, domain 3 2025-10-25 13:16:45.893 12535-12535 com.google...amples.rag com.google.ai.edge.samples.rag I vendor/qcom/proprietary/adsprpc/src/fastrpc_apps_user.c:3321: Created user PD on domain 3, debug_trace 0x0, enabled attr=> RPC timeout:0, Debug Mode:N, CRC check:N, Unsigned:Y, Signed:N, Adapt QOS:N, PD dump: (Config:N, Debug:N), Perf: (Kernel:N, DSP:N), Iregion:N, QTF Tracing:N, UAF heap:N userPD initmem length:0x5 2025-10-25 13:16:45.897 12535-12563 com.google...amples.rag com.google.ai.edge.samples.rag I vendor/qcom/proprietary/adsprpc/src/fastrpc_apps_user.c:617: Successfully set remote user thread priority to 192 and stack size to 17408 for domain 3 2025-10-25 13:16:45.897 12535-12563 com.google...amples.rag com.google.ai.edge.samples.rag I vendor/qcom/proprietary/adsprpc/src/listener_android.c:116: listener thread starting 2025-10-25 13:16:45.898 12535-12535 com.google...amples.rag com.google.ai.edge.samples.rag I vendor/qcom/proprietary/adsprpc/src/fastrpc_perf.c:273: fastrpc_perf_init: enabled systrace 0x0 and RPC traces (kernel 0, dsp 0) with frequency 1000 2025-10-25 13:16:45.898 12535-12564 com.google...amples.rag com.google.ai.edge.samples.rag I vendor/qcom/proprietary/adsprpc/src/log_config.c:348: file_watcher_thread starting for domain 3 2025-10-25 13:16:45.898 12535-12564 com.google...amples.rag com.google.ai.edge.samples.rag W vendor/qcom/proprietary/adsprpc/src/log_config.c:358:file_watcher_thread: Couldn't find file com.google.ai.edge.samples.rag.farf, errno (No such file or directory) 2025-10-25 13:16:45.893 12535-12535 dge.samples.rag com.google.ai.edge.samples.rag I type=1400 audit(0.0:601): avc: denied { watch } for path="/vendor/lib/rfsa/adsp" dev="overlay" ino=16 scontext=u:r:untrusted_app:s0:c248,c256,c512,c768 tcontext=u:object_r:vendor_file:s0 tclass=dir permissive=1 app=com.google.ai.edge.samples.rag 2025-10-25 13:16:45.893 12535-12535 dge.samples.rag com.google.ai.edge.samples.rag I type=1400 audit(0.0:602): avc: denied { read } for name="cdsp" dev="sde11" ino=70 scontext=u:r:untrusted_app:s0:c248,c256,c512,c768 tcontext=u:object_r:adsprpcd_file:s0 tclass=dir permissive=1 app=com.google.ai.edge.samples.rag 2025-10-25 13:16:45.893 12535-12535 dge.samples.rag com.google.ai.edge.samples.rag I type=1400 audit(0.0:603): avc: denied { watch } for path="/vendor/dsp/cdsp" dev="sde11" ino=70 scontext=u:r:untrusted_app:s0:c248,c256,c512,c768 tcontext=u:object_r:adsprpcd_file:s0 tclass=dir permissive=1 app=com.google.ai.edge.samples.rag 2025-10-25 13:16:45.899 12535-12563 com.google...amples.rag com.google.ai.edge.samples.rag I vendor/qcom/proprietary/adsprpc/src/mod_table.c:703: open_mod_table_open_from_static: reverse module apps_std opened with handle 0x49df0cc0 (idx 0) 2025-10-25 13:16:45.897 12535-12535 dge.samples.rag com.google.ai.edge.samples.rag I type=1400 audit(0.0:604): avc: denied { read } for name="libQnnHtpV73Skel.so" dev="dm-5" ino=2743 scontext=u:r:untrusted_app:s0:c248,c256,c512,c768 tcontext=u:object_r:vendor_file:s0 tclass=file permissive=1 app=com.google.ai.edge.samples.rag 2025-10-25 13:16:45.897 12535-12535 dge.samples.rag com.google.ai.edge.samples.rag I type=1400 audit(0.0:605): avc: denied { open } for path="/vendor/lib/rfsa/adsp/libQnnHtpV73Skel.so" dev="overlay" ino=2743 scontext=u:r:untrusted_app:s0:c248,c256,c512,c768 tcontext=u:object_r:vendor_file:s0 tclass=file permissive=1 app=com.google.ai.edge.samples.rag 2025-10-25 13:16:45.900 12535-12563 com.google...amples.rag com.google.ai.edge.samples.rag I vendor/qcom/proprietary/adsprpc/src/apps_std_imp.c:1002: Successfully opened file libQnnHtpV73Skel.so 2025-10-25 13:16:45.900 12535-12563 com.google...amples.rag com.google.ai.edge.samples.rag I vendor/qcom/proprietary/adsprpc/src/mod_table.c:703: open_mod_table_open_from_static: reverse module apps_mem opened with handle 0x49df0dc0 (idx 1) 2025-10-25 13:16:45.915 12535-12563 com.google...amples.rag com.google.ai.edge.samples.rag I vendor/qcom/proprietary/adsprpc/src/apps_std_imp.c:982: Successfully opened file /vendor/dsp/cdsp/libc++.so.1 2025-10-25 13:16:45.922 12535-12563 com.google...amples.rag com.google.ai.edge.samples.rag I vendor/qcom/proprietary/adsprpc/src/apps_std_imp.c:982: Successfully opened file /vendor/dsp/cdsp/libc++abi.so.1 2025-10-25 13:16:45.987 12535-12535 com.google...amples.rag com.google.ai.edge.samples.rag I vendor/qcom/proprietary/adsprpc/src/fastrpc_apps_user.c:1576: remote_handle64_open: Successfully opened handle 0xb400007573fa9210 (remote handle 0xed3b00) for file:///libQnnHtpV73Skel.so?qnn_skel_handle_invoke&_modver=1.0&_dom=cdsp on domain 3 (spawn time 0 us, load time 0 us), num of open handles: 1 2025-10-25 13:16:45.988 12535-12535 [Qnn ExecuTorch] com.google.ai.edge.samples.rag E QnnDsp <E> Failed to retrieve skel build id: err: 10010 2025-10-25 13:16:45.988 12535-12561 NativeStdout com.google.ai.edge.samples.rag I [ERROR] [Qnn ExecuTorch]: QnnDsp <E> Failed to retrieve skel build id: err: 10010 2025-10-25 13:16:45.988 12535-12535 [Qnn ExecuTorch] com.google.ai.edge.samples.rag E QnnDsp <E> Error in verify skel version 2025-10-25 13:16:45.988 12535-12561 NativeStdout com.google.ai.edge.samples.rag I [ERROR] [Qnn ExecuTorch]: QnnDsp <E> Error in verify skel version 2025-10-25 13:16:45.999 12535-12535 com.google...amples.rag com.google.ai.edge.samples.rag I vendor/qcom/proprietary/adsprpc/src/fastrpc_apps_user.c:1602: remote_handle_close: closed handle 0xed3b00 2025-10-25 13:16:45.999 12535-12535 com.google...amples.rag com.google.ai.edge.samples.rag I vendor/qcom/proprietary/adsprpc/src/fastrpc_apps_user.c:2867: domain_deinit for domain 3: dev 90 2025-10-25 13:16:45.999 12535-12563 com.google...amples.rag com.google.ai.edge.samples.rag E vendor/qcom/proprietary/adsprpc/src/fastrpc_apps_user.c:3732: exit_thread current process1 thread exit nErr = 0x8000040d domain 3, handle 0xb400007573fa9cd0 2025-10-25 13:16:45.999 12535-12564 com.google...amples.rag com.google.ai.edge.samples.rag W vendor/qcom/proprietary/adsprpc/src/log_config.c:368:Warning: file_watcher_thread received exit for domain 3, file com.google.ai.edge.samples.rag.farf 2025-10-25 13:16:45.999 12535-12564 com.google...amples.rag com.google.ai.edge.samples.rag I vendor/qcom/proprietary/adsprpc/src/log_config.c:415: file_watcher_thread exiting for domain 3 2025-10-25 13:16:46.000 12535-12535 com.google...amples.rag com.google.ai.edge.samples.rag I vendor/qcom/proprietary/adsprpc/src/mod_table.c:789: open_mod_table_close: closed reverse module apps_mem with handle 0x49df0dc0 2025-10-25 13:16:46.000 12535-12535 com.google...amples.rag com.google.ai.edge.samples.rag I vendor/qcom/proprietary/adsprpc/src/mod_table.c:789: open_mod_table_close: closed reverse module apps_std with handle 0x49df0cc0 2025-10-25 13:16:46.031 12535-12535 com.google...amples.rag com.google.ai.edge.samples.rag I vendor/qcom/proprietary/adsprpc/src/fastrpc_apps_user.c:1645: remote_handle64_close: closed handle 0xb400007573fa9210 remote handle 0xed3b00, num of open handles: 0 2025-10-25 13:16:46.031 12535-12535 [Qnn ExecuTorch] com.google.ai.edge.samples.rag E QnnDsp <E> Failed to create transport for device, error: 4000 2025-10-25 13:16:46.031 12535-12535 [Qnn ExecuTorch] com.google.ai.edge.samples.rag E QnnDsp <E> Failed to load skel, error: 4000 2025-10-25 13:16:46.031 12535-12535 [Qnn ExecuTorch] com.google.ai.edge.samples.rag E QnnDsp <E> Transport layer setup failed: 14001

Versions

OS: Ubuntu 22.04.5 LTS (x86_64)

cc @cccclai @winskuo-quic @shewu-quic @haowhsu-quic @DannyYuyang-quic @cbilgin

nambn007 avatar Oct 25 '25 06:10 nambn007

Are you using your jni layer? We actually have a example android demo app that works with the runner https://github.com/meta-pytorch/executorch-examples/tree/main/llm/android/LlamaDemo

cccclai avatar Oct 28 '25 20:10 cccclai

Are you using your jni layer? We actually have a example android demo app that works with the runner https://github.com/meta-pytorch/executorch-examples/tree/main/llm/android/LlamaDemo

@cccclai I followed this tutorial to build the QNN backend demo. I managed to build the AAR library and the Android APK, but failed to load the model. It says Error Code 18 (https://github.com/pytorch/executorch/issues/6095).

Because the PC only has 64GB of RAM, it failed to build Meta-Llama-3-8B-Instruct ( that requires 100+GB RAM), so I built Llama-3.2-1B and Llama-Guard-3-1B to run the Android demo app. Both failed because of the Error Code 18.

Do you have ideas?

Thanks,

luffy-yu avatar Nov 07 '25 23:11 luffy-yu

Maybe let's try step by step

  1. Are you able to run a simple model on your phone? https://github.com/pytorch/executorch/tree/main/examples/qualcomm#simple-examples-to-verify-the-backend-is-working
  2. Are you able to run the llama model via adb? https://github.com/pytorch/executorch/blob/main/examples/qualcomm/oss_scripts/llama/qnn_llama_runner.cpp
  3. If you are using the llm app, I do have a small PR that help fixes the issue https://github.com/pytorch/executorch/pull/15258 and I was able to see the result with the app.

This will help us narrow down the issue. There is an uploaded Qwen 0.6 model compiled for SM8750 (https://huggingface.co/flyingchen/Qwen3-0.6B). If you are using SM8750, maybe try it out. For other SoCs, this model won't work

cccclai avatar Nov 08 '25 17:11 cccclai

@cccclai Thank you for your reply.

I am trying to make it work on Quest 3, not a regular Android phone.

Since Quest 3 uses Snapdragonยฎ XR2 Gen 2 (SXR2230P) and SXR2230P is listed as one supported SoC, it should work.

For Step 1, it failed. Here is what I did.

# Export a simple model
python export_example.py -m add -g --soc SXR2230P -q ptq

# Push to device
# Run 

adb shell "cd ${DEVICE_DIR} \
           && export LD_LIBRARY_PATH=${DEVICE_DIR} \
           && export ADSP_LIBRARY_PATH=${DEVICE_DIR} \
           && ./qnn_executor_runner --model_path ./add.pte"

I 00:00:00.000554 executorch:qnn_executor_runner.cpp:232] Model file ./add.pte is loaded.
I 00:00:00.000573 executorch:qnn_executor_runner.cpp:242] Using method forward
I 00:00:00.000577 executorch:qnn_executor_runner.cpp:289] Setting up planned buffer 0, size 48.
[INFO] [Qnn ExecuTorch]: Deserializing processed data using QnnContextCustomProtocol
[INFO] [Qnn ExecuTorch]: create QNN Logger with log_level 1
[INFO] [Qnn ExecuTorch]: Initialize Qnn backend parameters for Qnn executorch backend type 2
[INFO] [Qnn ExecuTorch]: Caching: Caching is in RESTORE MODE.
[INFO] [Qnn ExecuTorch]: QnnContextCustomProtocol expected magic number: 0x5678abcd but get: 0x2000000
[ERROR] [Qnn ExecuTorch]: QnnDsp <E> createUnsignedPD unsigned PD or DSPRPC_GET_DSP_INFO not supported by HTP

[ERROR] [Qnn ExecuTorch]: QnnDsp <E> DspTransport.createUnsignedPD failed, 0x00000003

[ERROR] [Qnn ExecuTorch]: QnnDsp <E> IDspTransport: Unknown rpc status 0x00000003

[ERROR] [Qnn ExecuTorch]: QnnDsp <E> DspTransport.getHandle failed, error 0xffffffff

[ERROR] [Qnn ExecuTorch]: QnnDsp <E> createDspTransportInstance failed to config transport object

[ERROR] [Qnn ExecuTorch]: QnnDsp <E> error in creation of transport instance

[ERROR] [Qnn ExecuTorch]: QnnDsp <E> Failed to create transport for device, error: 1002

[ERROR] [Qnn ExecuTorch]: QnnDsp <E> Failed to load skel, error: 1002

[ERROR] [Qnn ExecuTorch]: QnnDsp <E> Transport layer setup failed: 14001

[ERROR] [Qnn ExecuTorch]: QnnDsp <E> Failed to parse default platform info: 14001

[ERROR] [Qnn ExecuTorch]: QnnDsp <E> Failed to load default platform info: 14001

[ERROR] [Qnn ExecuTorch]: QnnDsp <E> Failed to parse platform config: 14001

[ERROR] [Qnn ExecuTorch]: Failed to create device_handle for Backend ID 6, error=14001
E 00:00:00.009905 executorch:QnnManager.cpp:340] Fail to configure Qnn device
E 00:00:00.009913 executorch:QnnExecuTorchBackend.cpp:95] Fail to initialize Qnn Manager
E 00:00:00.009918 executorch:method.cpp:114] Init failed for backend QnnBackend: 0x1
F 00:00:00.009935 executorch:qnn_executor_runner.cpp:312] In function main(), assert failed (method.ok()): Loading of method forward failed with status 0x1
Aborted 

luffy-yu avatar Nov 10 '25 18:11 luffy-yu

Hi everyone,

I found the root cause of the issue. I was using a different version of the QNN library than the one available on the device. To fix this, I set both ADSP_LIBRARY_PATH and LD_LIBRARY_PATH to point to the QNN library that I used for model conversion and execution.

nambn007 avatar Nov 11 '25 03:11 nambn007

Maybe let's try step by step

  1. Are you able to run a simple model on your phone? https://github.com/pytorch/executorch/tree/main/examples/qualcomm#simple-examples-to-verify-the-backend-is-working
  2. Are you able to run the llama model via adb? https://github.com/pytorch/executorch/blob/main/examples/qualcomm/oss_scripts/llama/qnn_llama_runner.cpp
  3. If you are using the llm app, I do have a small PR that help fixes the issue Fix qnn in android demo appย #15258 and I was able to see the result with the app.

This will help us narrow down the issue. There is an uploaded Qwen 0.6 model compiled for SM8750 (https://huggingface.co/flyingchen/Qwen3-0.6B). If you are using SM8750, maybe try it out. For other SoCs, this model won't work

@cccclai Thank you for your hints. I got an SM8450 SoC Android Phone and tested it.

1 - Works

~/Documents/Code/executorch-examples/llm/android$ adb shell "cd ${DEVICE_DIR} \
           && export LD_LIBRARY_PATH=${DEVICE_DIR} \
           && export ADSP_LIBRARY_PATH=${DEVICE_DIR} \
           && ./qnn_executor_runner --model_path ./add.pte"
I 00:00:00.002972 executorch:qnn_executor_runner.cpp:232] Model file ./add.pte is loaded.
I 00:00:00.003013 executorch:qnn_executor_runner.cpp:242] Using method forward
I 00:00:00.003018 executorch:qnn_executor_runner.cpp:289] Setting up planned buffer 0, size 48.
[INFO] [Qnn ExecuTorch]: Deserializing processed data using QnnContextCustomProtocol
[INFO] [Qnn ExecuTorch]: create QNN Logger with log_level 1
[INFO] [Qnn ExecuTorch]: Initialize Qnn backend parameters for Qnn executorch backend type 2
[INFO] [Qnn ExecuTorch]: Caching: Caching is in RESTORE MODE.
[INFO] [Qnn ExecuTorch]: QnnContextCustomProtocol expected magic number: 0x5678abcd but get: 0x2000000
[INFO] [Qnn ExecuTorch]: Running level=1 optimization.
I 00:00:00.276934 executorch:qnn_executor_runner.cpp:313] Method loaded.
E 00:00:00.277097 executorch:method.cpp:1274] Output 0 is memory planned, or is a constant. Cannot override the existing data pointer.
I 00:00:00.277271 executorch:qnn_executor_runner.cpp:373] ignoring error from set_output_data_ptr(): 0x2
I 00:00:00.277410 executorch:qnn_executor_runner.cpp:376] Inputs prepared.
I 00:00:00.277735 executorch:qnn_executor_runner.cpp:570] Input list not provided. Inputs prepared with default values set.
I 00:00:00.278482 executorch:qnn_executor_runner.cpp:579] Model executed successfully.
I 00:00:00.278604 executorch:qnn_executor_runner.cpp:582] Perform 0 inferences for warming up
I 00:00:00.279030 executorch:qnn_executor_runner.cpp:604] 1 inferences took 0.249000 ms, avg 0.249000 ms
I 00:00:00.279238 executorch:qnn_executor_runner.cpp:615] Write etdump to etdump.etdp, Size = 576
[INFO] [Qnn ExecuTorch]: Destroy Qnn backend parameters
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn device
[INFO] [Qnn ExecuTorch]: Destroy Qnn backend

2.1 - LLAMA2 Works

~/Documents/Code/executorch$ python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --checkpoint stories110M.pt --params params.json --tokenizer_model tokenizer.model --tokenizer_bin tokenizer.bin --decoder_model stories110m --model_mode hybrid --prefill_ar_len 32 --max_seq_len 128 --prompt "Once upon a time"
I tokenizers:regex.cpp:27] Registering override fallback regex
NOTE: Using slow Hadamard transform for SpinQuant. For better performance on GPU, install `fast_hadamard_transform`: `pip install git+https://github.com/Dao-AILab/fast-hadamard-transform.git`
QNN_SDK_ROOT=/home/n10288/Android/qairt/2.37.0.250724
INFO:root:*** stories110m ***
+-----------------------------+---------------------------------------------------------------------------+
| Config                      | Value                                                                     |
+=============================+===========================================================================+
| custom_annotation           | (annotate_kv_8bit, annotate_output_16a8w, partial(annotate_qkv_proj_sha)) |
+-----------------------------+---------------------------------------------------------------------------+
| decoder_model_version       | llama2                                                                    |
+-----------------------------+---------------------------------------------------------------------------+
| get_kv_io_bit_width         | 8                                                                         |
+-----------------------------+---------------------------------------------------------------------------+
| get_logits_output_bit_width | 16                                                                        |
+-----------------------------+---------------------------------------------------------------------------+
| instruct_model              | False                                                                     |
+-----------------------------+---------------------------------------------------------------------------+
| masked_softmax              | False                                                                     |
+-----------------------------+---------------------------------------------------------------------------+
| num_sharding                | 1                                                                         |
+-----------------------------+---------------------------------------------------------------------------+
| ptq                         | QuantDtype.use_16a4w                                                      |
+-----------------------------+---------------------------------------------------------------------------+
| r1                          | False                                                                     |
+-----------------------------+---------------------------------------------------------------------------+
| r2                          | False                                                                     |
+-----------------------------+---------------------------------------------------------------------------+
| r3                          | False                                                                     |
+-----------------------------+---------------------------------------------------------------------------+
| seq_mse_candidates          | 0                                                                         |
+-----------------------------+---------------------------------------------------------------------------+
| transform_weight            | True                                                                      |
+-----------------------------+---------------------------------------------------------------------------+
INFO:root:#words: 32000 - BOS ID: 1 - EOS ID: 2
INFO:root:Time for loading checkpoint: 0.162306547164917
INFO:root:Quantizing the model...
INFO:root:kv inference result:
Once upon a time, there was a little girl named Lily. She loved to play outside in the sunshine. One day, she saw a big, red apple on a tree. She wanted to eat it, but it was too high up.
Lily asked her friend, a little bird, "Can you help me get the apple?"
The bird said, "Sure, I can fly up and get it for you."
The bird flew up to the apple and pecked it off the tree. Lily was so happy and took a big bite. But then, she saw a
INFO:root:Quantizing the model...
INFO:root:kv inference result:
Once upon a time, there was a little girl named Lily. She loved to play outside in the sunshine. One day, she saw a big, red apple on a tree. She wanted to eat it, but it was too high up.
Lily asked her friend, a little bird, "Can you help me get the apple?"
The bird said, "Sure, I can fly up and get it for you."
The bird flew up to the apple and pecked it off the tree. Lily was so happy and took a big bite. But then, she saw a
INFO:root:Time for quantizing: 161.5448431968689
WARNING:root:Op aten.unbind.int was requested for preservation by partitioner.  This request is ignored because it is in a blocklist.
WARNING:root:Op aten.unbind.int was requested for preservation by partitioner.  This request is ignored because it is in a blocklist.
WARNING:root:Op aten.unbind.int was requested for preservation by partitioner.  This request is ignored because it is in a blocklist.
WARNING:root:Op aten.unbind.int was requested for preservation by partitioner.  This request is ignored because it is in a blocklist.

2.2 - LLAMA3.2 1B Instruct (Should) Work

It was terminated by Ctrl + C.

~/Documents/Code/executorch$ python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --checkpoint ${MODEL_DIR}/consolidated.00.pth --params ${MODEL_DIR}/params.json --tokenizer_model ${MODEL_DIR}/tokenizer.model --decoder_model llama3_2-1b_instruct --model_mode hybrid --prefill_ar_len 128 --max_seq_len 1024 --prompt "I would like to learn python, could you teach me with a simple example?" --tasks wikitext --limit 1
I tokenizers:regex.cpp:27] Registering override fallback regex
NOTE: Using slow Hadamard transform for SpinQuant. For better performance on GPU, install `fast_hadamard_transform`: `pip install git+https://github.com/Dao-AILab/fast-hadamard-transform.git`
QNN_SDK_ROOT=/home/n10288/Android/qairt/2.37.0.250724
INFO:root:*** llama3_2-1b_instruct ***
+-----------------------------+------------------------------------------------------------------------+
| Config                      | Value                                                                  |
+=============================+========================================================================+
| custom_annotation           | (annotate_kv_8bit, annotate_output_16a8w, partial(annotate_down_proj)) |
+-----------------------------+------------------------------------------------------------------------+
| decoder_model_version       | llama3                                                                 |
+-----------------------------+------------------------------------------------------------------------+
| get_kv_io_bit_width         | 8                                                                      |
+-----------------------------+------------------------------------------------------------------------+
| get_logits_output_bit_width | 16                                                                     |
+-----------------------------+------------------------------------------------------------------------+
| group_size                  | 32                                                                     |
+-----------------------------+------------------------------------------------------------------------+
| instruct_model              | False                                                                  |
+-----------------------------+------------------------------------------------------------------------+
| masked_softmax              | False                                                                  |
+-----------------------------+------------------------------------------------------------------------+
| num_sharding                | 1                                                                      |
+-----------------------------+------------------------------------------------------------------------+
| ptq                         | QuantDtype.use_16a4w_block                                             |
+-----------------------------+------------------------------------------------------------------------+
| r1                          | False                                                                  |
+-----------------------------+------------------------------------------------------------------------+
| r2                          | False                                                                  |
+-----------------------------+------------------------------------------------------------------------+
| r3                          | False                                                                  |
+-----------------------------+------------------------------------------------------------------------+
| seq_mse_candidates          | 1000                                                                   |
+-----------------------------+------------------------------------------------------------------------+
| transform_weight            | True                                                                   |
+-----------------------------+------------------------------------------------------------------------+
Using Tiktokenizer
INFO:root:Time for loading checkpoint: 0.027213096618652344
INFO:root:Quantizing the model...
INFO:lm-eval:Using device 'cpu'
config.json: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 665/665 [00:00<00:00, 6.77MB/s]
INFO:lm-eval:Using model type 'default'
tokenizer_config.json: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 26.0/26.0 [00:00<00:00, 262kB/s]
vocab.json: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 1.04M/1.04M [00:00<00:00, 34.7MB/s]
merges.txt: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 456k/456k [00:00<00:00, 91.5MB/s]
tokenizer.json: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 1.36M/1.36M [00:00<00:00, 36.6MB/s]
INFO:lm-eval:Model parallel was set to False, max memory was not set, and device map was set to {'': 'cpu'}
`torch_dtype` is deprecated! Use `dtype` instead!
model.safetensors: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 548M/548M [00:04<00:00, 125MB/s]
generation_config.json: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 124/124 [00:00<00:00, 1.47MB/s]
INFO:lm-eval:Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
INFO:lm-eval:Using pre-initialized model
INFO:lm-eval:The tag 'arc_ca' is already registered as a group, this tag will not be registered. This may affect tasks you want to call.
INFO:lm-eval:The tag 'arc_ca' is already registered as a group, this tag will not be registered. This may affect tasks you want to call.
WARNING:lm-eval:[Task: wikitext] metric word_perplexity is defined, but aggregation is not. using default aggregation=weighted_perplexity
WARNING:lm-eval:[Task: wikitext] metric word_perplexity is defined, but higher_is_better is not. using default higher_is_better=False
WARNING:lm-eval:[Task: wikitext] metric byte_perplexity is defined, but aggregation is not. using default aggregation=weighted_perplexity
WARNING:lm-eval:[Task: wikitext] metric byte_perplexity is defined, but higher_is_better is not. using default higher_is_better=False
WARNING:lm-eval:[Task: wikitext] metric bits_per_byte is defined, but aggregation is not. using default aggregation=bits_per_byte
WARNING:lm-eval:[Task: wikitext] metric bits_per_byte is defined, but higher_is_better is not. using default higher_is_better=False
README.md: 8.76kB [00:00, 51.4MB/s]
wikitext-2-raw-v1/wikitext-2-raw-v1-trai(โ€ฆ): 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 6.18M/6.18M [00:00<00:00, 26.7MB/s]
wikitext-2-raw-v1/wikitext-2-raw-v1-vali(โ€ฆ): 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 641k/641k [00:00<00:00, 2.68MB/s]
wikitext-2-raw-v1/wikitext-2-raw-v1-test(โ€ฆ): 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 715k/715k [00:00<00:00, 6.17MB/s]
Generating train split: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 629/629 [00:00<00:00, 16437.39 examples/s]
Generating validation split: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 60/60 [00:00<00:00, 19178.34 examples/s]
Generating test split: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 62/62 [00:00<00:00, 21003.70 examples/s]
INFO:lm-eval:Building contexts for wikitext on rank 0...
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 1/1 [00:00<00:00, 3398.95it/s]

3 - Still Get Error Code 18

I integrated this PR and built the Android App. It still failed to load the QNN-backend model (tested Llama-3.2-1B and Llama-Guard-3-1B). These models were exported through the following command.

python -m extension.llm.export.export_llm base.checkpoint="${MODEL_DIR}/consolidated.00.pth" base.params="${MODEL_DIR}/params.json" model.use_kv_cache=True model.enable_dynamic_shape=False backend.qnn.enabled=True quantization.pt2e_quantize="qnn_16a4w" model.dtype_override="fp32" base.metadata='"{\"get_bos_id\":128000, \"get_eos_ids\":[128009, 128001]}"' export.output_name="test.pte"

luffy-yu avatar Nov 14 '25 23:11 luffy-yu

Maybe let's try step by step

  1. Are you able to run a simple model on your phone? https://github.com/pytorch/executorch/tree/main/examples/qualcomm#simple-examples-to-verify-the-backend-is-working
  2. Are you able to run the llama model via adb? https://github.com/pytorch/executorch/blob/main/examples/qualcomm/oss_scripts/llama/qnn_llama_runner.cpp
  3. If you are using the llm app, I do have a small PR that help fixes the issue Fix qnn in android demo appย #15258 and I was able to see the result with the app.

This will help us narrow down the issue. There is an uploaded Qwen 0.6 model compiled for SM8750 (https://huggingface.co/flyingchen/Qwen3-0.6B). If you are using SM8750, maybe try it out. For other SoCs, this model won't work

@cccclai Thank you for your hints. I got an SM8450 SoC Android Phone and tested it.

1 - Works

~/Documents/Code/executorch-examples/llm/android$ adb shell "cd ${DEVICE_DIR} \
           && export LD_LIBRARY_PATH=${DEVICE_DIR} \
           && export ADSP_LIBRARY_PATH=${DEVICE_DIR} \
           && ./qnn_executor_runner --model_path ./add.pte"
I 00:00:00.002972 executorch:qnn_executor_runner.cpp:232] Model file ./add.pte is loaded.
I 00:00:00.003013 executorch:qnn_executor_runner.cpp:242] Using method forward
I 00:00:00.003018 executorch:qnn_executor_runner.cpp:289] Setting up planned buffer 0, size 48.
[INFO] [Qnn ExecuTorch]: Deserializing processed data using QnnContextCustomProtocol
[INFO] [Qnn ExecuTorch]: create QNN Logger with log_level 1
[INFO] [Qnn ExecuTorch]: Initialize Qnn backend parameters for Qnn executorch backend type 2
[INFO] [Qnn ExecuTorch]: Caching: Caching is in RESTORE MODE.
[INFO] [Qnn ExecuTorch]: QnnContextCustomProtocol expected magic number: 0x5678abcd but get: 0x2000000
[INFO] [Qnn ExecuTorch]: Running level=1 optimization.
I 00:00:00.276934 executorch:qnn_executor_runner.cpp:313] Method loaded.
E 00:00:00.277097 executorch:method.cpp:1274] Output 0 is memory planned, or is a constant. Cannot override the existing data pointer.
I 00:00:00.277271 executorch:qnn_executor_runner.cpp:373] ignoring error from set_output_data_ptr(): 0x2
I 00:00:00.277410 executorch:qnn_executor_runner.cpp:376] Inputs prepared.
I 00:00:00.277735 executorch:qnn_executor_runner.cpp:570] Input list not provided. Inputs prepared with default values set.
I 00:00:00.278482 executorch:qnn_executor_runner.cpp:579] Model executed successfully.
I 00:00:00.278604 executorch:qnn_executor_runner.cpp:582] Perform 0 inferences for warming up
I 00:00:00.279030 executorch:qnn_executor_runner.cpp:604] 1 inferences took 0.249000 ms, avg 0.249000 ms
I 00:00:00.279238 executorch:qnn_executor_runner.cpp:615] Write etdump to etdump.etdp, Size = 576
[INFO] [Qnn ExecuTorch]: Destroy Qnn backend parameters
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn device
[INFO] [Qnn ExecuTorch]: Destroy Qnn backend

2.1 - LLAMA2 Works

~/Documents/Code/executorch$ python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --checkpoint stories110M.pt --params params.json --tokenizer_model tokenizer.model --tokenizer_bin tokenizer.bin --decoder_model stories110m --model_mode hybrid --prefill_ar_len 32 --max_seq_len 128 --prompt "Once upon a time"
I tokenizers:regex.cpp:27] Registering override fallback regex
NOTE: Using slow Hadamard transform for SpinQuant. For better performance on GPU, install `fast_hadamard_transform`: `pip install git+https://github.com/Dao-AILab/fast-hadamard-transform.git`
QNN_SDK_ROOT=/home/n10288/Android/qairt/2.37.0.250724
INFO:root:*** stories110m ***
+-----------------------------+---------------------------------------------------------------------------+
| Config                      | Value                                                                     |
+=============================+===========================================================================+
| custom_annotation           | (annotate_kv_8bit, annotate_output_16a8w, partial(annotate_qkv_proj_sha)) |
+-----------------------------+---------------------------------------------------------------------------+
| decoder_model_version       | llama2                                                                    |
+-----------------------------+---------------------------------------------------------------------------+
| get_kv_io_bit_width         | 8                                                                         |
+-----------------------------+---------------------------------------------------------------------------+
| get_logits_output_bit_width | 16                                                                        |
+-----------------------------+---------------------------------------------------------------------------+
| instruct_model              | False                                                                     |
+-----------------------------+---------------------------------------------------------------------------+
| masked_softmax              | False                                                                     |
+-----------------------------+---------------------------------------------------------------------------+
| num_sharding                | 1                                                                         |
+-----------------------------+---------------------------------------------------------------------------+
| ptq                         | QuantDtype.use_16a4w                                                      |
+-----------------------------+---------------------------------------------------------------------------+
| r1                          | False                                                                     |
+-----------------------------+---------------------------------------------------------------------------+
| r2                          | False                                                                     |
+-----------------------------+---------------------------------------------------------------------------+
| r3                          | False                                                                     |
+-----------------------------+---------------------------------------------------------------------------+
| seq_mse_candidates          | 0                                                                         |
+-----------------------------+---------------------------------------------------------------------------+
| transform_weight            | True                                                                      |
+-----------------------------+---------------------------------------------------------------------------+
INFO:root:#words: 32000 - BOS ID: 1 - EOS ID: 2
INFO:root:Time for loading checkpoint: 0.162306547164917
INFO:root:Quantizing the model...
INFO:root:kv inference result:
Once upon a time, there was a little girl named Lily. She loved to play outside in the sunshine. One day, she saw a big, red apple on a tree. She wanted to eat it, but it was too high up.
Lily asked her friend, a little bird, "Can you help me get the apple?"
The bird said, "Sure, I can fly up and get it for you."
The bird flew up to the apple and pecked it off the tree. Lily was so happy and took a big bite. But then, she saw a
INFO:root:Quantizing the model...
INFO:root:kv inference result:
Once upon a time, there was a little girl named Lily. She loved to play outside in the sunshine. One day, she saw a big, red apple on a tree. She wanted to eat it, but it was too high up.
Lily asked her friend, a little bird, "Can you help me get the apple?"
The bird said, "Sure, I can fly up and get it for you."
The bird flew up to the apple and pecked it off the tree. Lily was so happy and took a big bite. But then, she saw a
INFO:root:Time for quantizing: 161.5448431968689
WARNING:root:Op aten.unbind.int was requested for preservation by partitioner.  This request is ignored because it is in a blocklist.
WARNING:root:Op aten.unbind.int was requested for preservation by partitioner.  This request is ignored because it is in a blocklist.
WARNING:root:Op aten.unbind.int was requested for preservation by partitioner.  This request is ignored because it is in a blocklist.
WARNING:root:Op aten.unbind.int was requested for preservation by partitioner.  This request is ignored because it is in a blocklist.

2.2 - LLAMA3.2 1B Instruct (Should) Work

It was terminated by Ctrl + C.

~/Documents/Code/executorch$ python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --checkpoint ${MODEL_DIR}/consolidated.00.pth --params ${MODEL_DIR}/params.json --tokenizer_model ${MODEL_DIR}/tokenizer.model --decoder_model llama3_2-1b_instruct --model_mode hybrid --prefill_ar_len 128 --max_seq_len 1024 --prompt "I would like to learn python, could you teach me with a simple example?" --tasks wikitext --limit 1
I tokenizers:regex.cpp:27] Registering override fallback regex
NOTE: Using slow Hadamard transform for SpinQuant. For better performance on GPU, install `fast_hadamard_transform`: `pip install git+https://github.com/Dao-AILab/fast-hadamard-transform.git`
QNN_SDK_ROOT=/home/n10288/Android/qairt/2.37.0.250724
INFO:root:*** llama3_2-1b_instruct ***
+-----------------------------+------------------------------------------------------------------------+
| Config                      | Value                                                                  |
+=============================+========================================================================+
| custom_annotation           | (annotate_kv_8bit, annotate_output_16a8w, partial(annotate_down_proj)) |
+-----------------------------+------------------------------------------------------------------------+
| decoder_model_version       | llama3                                                                 |
+-----------------------------+------------------------------------------------------------------------+
| get_kv_io_bit_width         | 8                                                                      |
+-----------------------------+------------------------------------------------------------------------+
| get_logits_output_bit_width | 16                                                                     |
+-----------------------------+------------------------------------------------------------------------+
| group_size                  | 32                                                                     |
+-----------------------------+------------------------------------------------------------------------+
| instruct_model              | False                                                                  |
+-----------------------------+------------------------------------------------------------------------+
| masked_softmax              | False                                                                  |
+-----------------------------+------------------------------------------------------------------------+
| num_sharding                | 1                                                                      |
+-----------------------------+------------------------------------------------------------------------+
| ptq                         | QuantDtype.use_16a4w_block                                             |
+-----------------------------+------------------------------------------------------------------------+
| r1                          | False                                                                  |
+-----------------------------+------------------------------------------------------------------------+
| r2                          | False                                                                  |
+-----------------------------+------------------------------------------------------------------------+
| r3                          | False                                                                  |
+-----------------------------+------------------------------------------------------------------------+
| seq_mse_candidates          | 1000                                                                   |
+-----------------------------+------------------------------------------------------------------------+
| transform_weight            | True                                                                   |
+-----------------------------+------------------------------------------------------------------------+
Using Tiktokenizer
INFO:root:Time for loading checkpoint: 0.027213096618652344
INFO:root:Quantizing the model...
INFO:lm-eval:Using device 'cpu'
config.json: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 665/665 [00:00<00:00, 6.77MB/s]
INFO:lm-eval:Using model type 'default'
tokenizer_config.json: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 26.0/26.0 [00:00<00:00, 262kB/s]
vocab.json: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 1.04M/1.04M [00:00<00:00, 34.7MB/s]
merges.txt: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 456k/456k [00:00<00:00, 91.5MB/s]
tokenizer.json: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 1.36M/1.36M [00:00<00:00, 36.6MB/s]
INFO:lm-eval:Model parallel was set to False, max memory was not set, and device map was set to {'': 'cpu'}
`torch_dtype` is deprecated! Use `dtype` instead!
model.safetensors: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 548M/548M [00:04<00:00, 125MB/s]
generation_config.json: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 124/124 [00:00<00:00, 1.47MB/s]
INFO:lm-eval:Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
INFO:lm-eval:Using pre-initialized model
INFO:lm-eval:The tag 'arc_ca' is already registered as a group, this tag will not be registered. This may affect tasks you want to call.
INFO:lm-eval:The tag 'arc_ca' is already registered as a group, this tag will not be registered. This may affect tasks you want to call.
WARNING:lm-eval:[Task: wikitext] metric word_perplexity is defined, but aggregation is not. using default aggregation=weighted_perplexity
WARNING:lm-eval:[Task: wikitext] metric word_perplexity is defined, but higher_is_better is not. using default higher_is_better=False
WARNING:lm-eval:[Task: wikitext] metric byte_perplexity is defined, but aggregation is not. using default aggregation=weighted_perplexity
WARNING:lm-eval:[Task: wikitext] metric byte_perplexity is defined, but higher_is_better is not. using default higher_is_better=False
WARNING:lm-eval:[Task: wikitext] metric bits_per_byte is defined, but aggregation is not. using default aggregation=bits_per_byte
WARNING:lm-eval:[Task: wikitext] metric bits_per_byte is defined, but higher_is_better is not. using default higher_is_better=False
README.md: 8.76kB [00:00, 51.4MB/s]
wikitext-2-raw-v1/wikitext-2-raw-v1-trai(โ€ฆ): 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 6.18M/6.18M [00:00<00:00, 26.7MB/s]
wikitext-2-raw-v1/wikitext-2-raw-v1-vali(โ€ฆ): 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 641k/641k [00:00<00:00, 2.68MB/s]
wikitext-2-raw-v1/wikitext-2-raw-v1-test(โ€ฆ): 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 715k/715k [00:00<00:00, 6.17MB/s]
Generating train split: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 629/629 [00:00<00:00, 16437.39 examples/s]
Generating validation split: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 60/60 [00:00<00:00, 19178.34 examples/s]
Generating test split: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 62/62 [00:00<00:00, 21003.70 examples/s]
INFO:lm-eval:Building contexts for wikitext on rank 0...
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 1/1 [00:00<00:00, 3398.95it/s]

3 - Still Get Error Code 18

I integrated this PR and built the Android App. It still failed to load the QNN-backend model (tested Llama-3.2-1B and Llama-Guard-3-1B). These models were exported through the following command.

python -m extension.llm.export.export_llm base.checkpoint="${MODEL_DIR}/consolidated.00.pth" base.params="${MODEL_DIR}/params.json" model.use_kv_cache=True model.enable_dynamic_shape=False backend.qnn.enabled=True quantization.pt2e_quantize="qnn_16a4w" model.dtype_override="fp32" base.metadata='"{\"get_bos_id\":128000, \"get_eos_ids\":[128009, 128001]}"' export.output_name="test.pte"

Glad that you get some models working! I think the second one is a stories model, instead of a llama model. For the llama 1b model, can you try the adb binary and share more error message? The command will be something like this and make sure we have the LD_LIBRARY_PATH and ADSP_LIBRARY_PATH setup correctly

 adb shell "cd /data/local/tmp/llama && ./qnn_llama_runner --decoder_model_version qwen3 --tokenizer_path tokenizer.json --model_path model_SM8750.pte --tokenizer_path tokenizer.json --prompt 'who are you' --seq_len 512 --kv_updater SmartMask --eval_mode 1 --temperature 0.8 && cat outputs.txt "

cccclai avatar Nov 15 '25 00:11 cccclai

Oh wait, you were generating the model with this command

python -m extension.llm.export.export_llm base.checkpoint="${MODEL_DIR}/consolidated.00.pth" base.params="${MODEL_DIR}/params.json" model.use_kv_cache=True model.enable_dynamic_shape=False backend.qnn.enabled=True quantization.pt2e_quantize="qnn_16a4w" model.dtype_override="fp32" base.metadata='"{\"get_bos_id\":128000, \"get_eos_ids\":[128009, 128001]}"' export.output_name="test.pte"

I feel like this path is kind of outdated and not performant. examples/qualcomm/oss_scripts/llama/llama.py should provide better results

cccclai avatar Nov 15 '25 00:11 cccclai

Glad that you get some models working! I think the second one is a stories model, instead of a llama model. For the llama 1b model, can you try the adb binary and share more error message? The command will be something like this and make sure we have the LD_LIBRARY_PATH and ADSP_LIBRARY_PATH setup correctly

 adb shell "cd /data/local/tmp/llama && ./qnn_llama_runner --decoder_model_version qwen3 --tokenizer_path tokenizer.json --model_path model_SM8750.pte --tokenizer_path tokenizer.json --prompt 'who are you' --seq_len 512 --kv_updater SmartMask --eval_mode 1 --temperature 0.8 && cat outputs.txt "

@cccclai Thank you for your help.

  • I tried qnn_llama_runner. It didn't work as expected because of the SoC (expected 8750 but got 8450).

Here is the command.

~/Documents/Code/Qwen3-0.6B$ adb shell "cd ${DEVICE_DIR} && ./qnn_llama_runner --decoder_model_version qwen3 --tokenizer_path tokenizer.json --model_path model_sm8750.pte --tokenizer_path tokenizer.json --prompt 'who are you' --seq_len 512 --kv_updater SmartMask --eval_mode 1 --temperature 0.8 && cat outputs.txt "
I tokenizers:regex.cpp:27] Registering override fallback regex
I 00:00:00.000635 executorch:runner.cpp:146] creating module: model_path=model_sm8750.pte
I 00:00:00.000682 executorch:runner.cpp:147] creating runner: tokenizer_path=tokenizer.json
I 00:00:00.000708 executorch:runner.cpp:148] eval mode=1
I 00:00:00.000737 executorch:runner.cpp:149] kv updater=SmartMask
I tokenizers:hf_tokenizer.cpp:142] Setting up normalizer...
I tokenizers:normalizer.cpp:89] Using NFC normalizer. Please notice that our implementation may not handle all edge cases.
I tokenizers:hf_tokenizer.cpp:146] Normalizer set up
I tokenizers:hf_tokenizer.cpp:160] Setting up pretokenizer...
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1763482869.307762   25134 re2.cc:237] Error parsing '((?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s...': invalid perl operator: (?!
I tokenizers:re2_regex.cpp:27] Re2 failed to compile regex: ((?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+), error: invalid perl operator: (?!
This may be ok if a fallback regex is used.
I tokenizers:regex_lookahead.cpp:27] Creating PCRE2 regex
I tokenizers:hf_tokenizer.cpp:164] Pretokenizer set up
I tokenizers:hf_tokenizer.cpp:180] Loading BPE merges...
I tokenizers:hf_tokenizer.cpp:218] Loaded 151387 BPE merge rules
I tokenizers:hf_tokenizer.cpp:230] Built merge ranks map with 151387 entries
I 00:00:00.817653 executorch:llm_runner_helper.cpp:54] Loaded json tokenizer
I tokenizers:hf_tokenizer.cpp:393] normalized input: '' -> ''
[INFO] [Qnn ExecuTorch]: Deserializing processed data using QnnContextCustomProtocol
[INFO] [Qnn ExecuTorch]: create QNN Logger with log_level 1
[INFO] [Qnn ExecuTorch]: Initialize Qnn backend parameters for Qnn executorch backend type 2
[INFO] [Qnn ExecuTorch]: Caching: Caching is in RESTORE MODE.
[INFO] [Qnn ExecuTorch]: QnnContextCustomProtocol expected magic number: 0x5678abcd but get: 0x2000000
[ERROR] [Qnn ExecuTorch]: QnnDsp <E> Request feature arch with value 79 unsupported

[ERROR] [Qnn ExecuTorch]: QnnDsp <E> Failed to register context to device and backend

[ERROR] [Qnn ExecuTorch]: QnnDsp <E> Failed to create context from binary with err 0x138d

[ERROR] [Qnn ExecuTorch]: Can't create context from binary. Error 5005.
E 00:00:00.957879 executorch:QnnManager.cpp:344] Fail to configure Qnn context
E 00:00:00.957885 executorch:QnnExecuTorchBackend.cpp:95] Fail to initialize Qnn Manager
E 00:00:00.957889 executorch:method.cpp:114] Init failed for backend QnnBackend: 0x1
F 00:00:00.958077 executorch:result.h:170] In function CheckOk(), assert failed: hasValue_
Aborted 
~/Documents/Code/executorch$ python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --temperature 0 --model_mode hybrid --max_seq_len 1024 --prefill_ar_len 128 --decoder_model qwen3-0_6b --prompt "I would like to learn python, could you teach me with a simple example?" --tasks wikitext --limit 1
I tokenizers:regex.cpp:27] Registering override fallback regex
NOTE: Using slow Hadamard transform for SpinQuant. For better performance on GPU, install `fast_hadamard_transform`: `pip install git+https://github.com/Dao-AILab/fast-hadamard-transform.git`
QNN_SDK_ROOT=/home/n10288/Android/qairt/2.37.0.250724
INFO:root:*** qwen3-0_6b ***
+-----------------------------+-------------------------------+
| Config                      | Value                         |
+=============================+===============================+
| custom_annotation           | (partial(annotate_down_proj)) |
+-----------------------------+-------------------------------+
| decoder_model_version       | qwen3                         |
+-----------------------------+-------------------------------+
| get_kv_io_bit_width         | 16                            |
+-----------------------------+-------------------------------+
| get_logits_output_bit_width | 16                            |
+-----------------------------+-------------------------------+
| group_size                  | 32                            |
+-----------------------------+-------------------------------+
| instruct_model              | True                          |
+-----------------------------+-------------------------------+
| masked_softmax              | True                          |
+-----------------------------+-------------------------------+
| num_sharding                | 1                             |
+-----------------------------+-------------------------------+
| ptq                         | QuantDtype.use_16a4w_block    |
+-----------------------------+-------------------------------+
| r1                          | False                         |
+-----------------------------+-------------------------------+
| r2                          | False                         |
+-----------------------------+-------------------------------+
| r3                          | False                         |
+-----------------------------+-------------------------------+
| repo_id                     | Qwen/Qwen3-0.6B               |
+-----------------------------+-------------------------------+
| seq_mse_candidates          | 1000                          |
+-----------------------------+-------------------------------+
| transform_weight            | False                         |
+-----------------------------+-------------------------------+
โœ” Using cached converted model: /home/n10288/.cache/meta_checkpoints/Qwen_Qwen3-0.6B.pth
INFO:root:Time for loading checkpoint: 0.09171867370605469

(After a long while)

[QNN Partitioner Op Support]: aten.view_copy.default | True
[ERROR] [Qnn ExecuTorch]:  <E> [4294967295] has incorrect Value 69, expected >= 73.

[ERROR] [Qnn ExecuTorch]:  <E> QnnBackend_validateOpConfig failed 3110

[ERROR] [Qnn ExecuTorch]:  <E> Failed to validate op aten_matmul_default_356 with error 0xc26

[WARNING] [Qnn ExecuTorch]: Qnn Backend op validation failed with error: 3110
[QNN Partitioner Op Support]: aten.matmul.default | False
[ERROR] [Qnn ExecuTorch]:  <E> [4294967295] has incorrect Value 69, expected >= 73.

[ERROR] [Qnn ExecuTorch]:  <E> QnnBackend_validateOpConfig failed 3110

[ERROR] [Qnn ExecuTorch]:  <E> Failed to validate op aten_matmul_default_355 with error 0xc26

[WARNING] [Qnn ExecuTorch]: Qnn Backend op validation failed with error: 3110
[QNN Partitioner Op Support]: aten.matmul.default | False

From this log, it only supports at least SoC SM8550 (Hexagon Tensor Processor 73), while my SM8450 is 69. Reference

I think SM8450 should be supported according to backends-qualcomm.md.

luffy-yu avatar Nov 18 '25 18:11 luffy-yu

Oh, I think SM8450 lack of support for weight sharing and block wise quantization (cc: @haowhsu-quic correct me if I'm wrong). If you want to do SM8450, you may need to use other quantization config, which might hurt performance or accuracy a bit.

cccclai avatar Nov 18 '25 19:11 cccclai

Oh, I think SM8450 lack of support for weight sharing and block wise quantization (cc: @haowhsu-quic correct me if I'm wrong). If you want to do SM8450, you may need to use other quantization config, which might hurt performance or accuracy a bit.

@cccclai Thank you. I would like to know how to apply other quantization configs.

The following shows the current command in use. It keeps running even though some errors happen.

python examples/qualcomm/oss_scripts/llama/llama.py \
    -b build-android \
    -m SM8450 \
    --compile_only \
    --decoder_model qwen3-0_6b \
    --prompt "dummy" \
    --model_mode kv \
    --max_seq_len 1024 \
    --prefill_ar_len 128 \
    --temperature 0 \
    --dtype-override fp32 \
    --range_setting minmax \
    --artifact ./qwen3_06b_sm8450_pte
	

[ERROR] [Qnn ExecuTorch]:  <E> [4294967295] has incorrect Value 69, expected >= 73.

[ERROR] [Qnn ExecuTorch]:  <E> QnnBackend_validateOpConfig failed 3110

[ERROR] [Qnn ExecuTorch]:  <E> Failed to validate op aten_matmul_default_658 with error 0xc26

[WARNING] [Qnn ExecuTorch]: Qnn Backend op validation failed with error: 3110
[QNN Partitioner Op Support]: aten.matmul.default | False
[ERROR] [Qnn ExecuTorch]:  <E> [4294967295] has incorrect Value 69, expected >= 73.

[ERROR] [Qnn ExecuTorch]:  <E> QnnBackend_validateOpConfig failed 3110

[ERROR] [Qnn ExecuTorch]:  <E> Failed to validate op aten_matmul_default_657 with error 0xc26

[WARNING] [Qnn ExecuTorch]: Qnn Backend op validation failed with error: 3110

luffy-yu avatar Nov 18 '25 21:11 luffy-yu

Maybe follow the instructions in https://github.com/pytorch/executorch/issues/15410

cccclai avatar Nov 18 '25 21:11 cccclai

@cccclai Thank you. After I used an SM8550 Android Phone, it can run QWen3-0.6B via qnn_llama_runner now.

BTW, I checked #15410 and didn't follow that, but it still worked.

# Export
python examples/qualcomm/oss_scripts/llama/llama.py \
    -b build-android \
    -m SM8550 \
    --compile_only \
    --decoder_model qwen3-0_6b \
    --prompt "dummy" \
    --model_mode hybrid \
    --max_seq_len 1024 \
    --prefill_ar_len 128 \
    --temperature 0 \
    --dtype-override fp32 \
    --range_setting minmax \
    --artifact ./qwen3_06b_sm8550_hybrid

# Run
(executorch) n10288@N10288:~/Documents/Code/executorch/qwen3_06b_sm8550_hybrid$ adb shell "cd ${DEVICE_DIR} && ./qnn_llama_runner --decoder_model_version qwen3 --tokenizer_path tokenizer.json --model_path model_SM8550.pte --prompt 'who are you' --seq_len 512 --kv_updater SmartMask --eval_mode 1 --temperature 0.8 && cat outputs.txt "
I tokenizers:regex.cpp:27] Registering override fallback regex
I 00:00:00.002984 executorch:runner.cpp:146] creating module: model_path=model_SM8550.pte
I 00:00:00.003114 executorch:runner.cpp:147] creating runner: tokenizer_path=tokenizer.json
I 00:00:00.003177 executorch:runner.cpp:148] eval mode=1
I 00:00:00.003228 executorch:runner.cpp:149] kv updater=SmartMask
I tokenizers:hf_tokenizer.cpp:142] Setting up normalizer...
I tokenizers:normalizer.cpp:89] Using NFC normalizer. Please notice that our implementation may not handle all edge cases.
I tokenizers:hf_tokenizer.cpp:146] Normalizer set up
I tokenizers:hf_tokenizer.cpp:160] Setting up pretokenizer...
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1763919887.536196   25786 re2.cc:237] Error parsing '((?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s...': invalid perl operator: (?!
I tokenizers:re2_regex.cpp:27] Re2 failed to compile regex: ((?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+), error: invalid perl operator: (?!
This may be ok if a fallback regex is used.
I tokenizers:regex_lookahead.cpp:27] Creating PCRE2 regex
I tokenizers:hf_tokenizer.cpp:164] Pretokenizer set up
I tokenizers:hf_tokenizer.cpp:180] Loading BPE merges...
I tokenizers:hf_tokenizer.cpp:218] Loaded 151387 BPE merge rules
I tokenizers:hf_tokenizer.cpp:230] Built merge ranks map with 151387 entries
I 00:00:00.628132 executorch:llm_runner_helper.cpp:54] Loaded json tokenizer
I tokenizers:hf_tokenizer.cpp:393] normalized input: '' -> ''
[INFO] [Qnn ExecuTorch]: Deserializing processed data using QnnContextCustomProtocol
[INFO] [Qnn ExecuTorch]: create QNN Logger with log_level 1
[INFO] [Qnn ExecuTorch]: Initialize Qnn backend parameters for Qnn executorch backend type 2
[INFO] [Qnn ExecuTorch]: Caching: Caching is in RESTORE MODE.
[INFO] [Qnn ExecuTorch]: QnnContextCustomProtocol expected magic number: 0x5678abcd but get: 0x2000000
[INFO] [Qnn ExecuTorch]: Running level=1 optimization.
[INFO] [Qnn ExecuTorch]: Running level=1 optimization.
[INFO] [Qnn ExecuTorch]: Deserializing processed data using QnnContextCustomProtocol
[INFO] [Qnn ExecuTorch]: Use cached delegate handle for current method: kv_forward
I 00:00:01.909229 executorch:runner.cpp:226] Reading metadata from model
I 00:00:01.999148 executorch:runner.cpp:342] creating io_memory
I tokenizers:hf_tokenizer.cpp:393] normalized input: '' -> ''
I tokenizers:hf_tokenizer.cpp:393] normalized input: 'user
who are you' -> 'user
who are you'
I tokenizers:hf_tokenizer.cpp:393] normalized input: '
' -> '
'
I tokenizers:hf_tokenizer.cpp:393] normalized input: 'assistant' -> 'assistant'
I 00:00:02.003661 executorch:prompt_processor.cpp:265] Prompt Processor: total 11 prompt tokens (AR-128 * 1 iters)
I 00:00:02.124552 executorch:runner.cpp:438] RSS after prompt prefill: 714.371094 MiB (0 if unsupported)
I 00:00:05.648229 executorch:token_generator.cpp:322] 
Reached to the end of generation
I 00:00:05.648417 executorch:runner.cpp:448] RSS after finishing text generation: 714.371094 MiB (0 if unsupported)
I 00:00:05.648480 executorch:stats.h:108] 	Prompt Tokens: 11    Generated Tokens: 139
I 00:00:05.648508 executorch:stats.h:114] 	Model Load Time:		1.997000 (seconds)
I 00:00:05.648528 executorch:stats.h:124] 	Total inference time:		3.647000 (seconds)		 Rate: 	38.113518 (tokens/second)
I 00:00:05.648547 executorch:stats.h:132] 		Prompt evaluation:	0.123000 (seconds)		 Rate: 	89.430894 (tokens/second)
I 00:00:05.648565 executorch:stats.h:143] 		Generated 139 tokens:	3.524000 (seconds)		 Rate: 	39.443814 (tokens/second)
I 00:00:05.648584 executorch:stats.h:151] 	Time to first generated token:	0.123000 (seconds)
I 00:00:05.648601 executorch:stats.h:158] 	Sampling time over 150 tokens:	0.227000 (seconds)
[INFO] [Qnn ExecuTorch]: Destroy Qnn backend parameters
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn device
[INFO] [Qnn ExecuTorch]: Destroy Qnn backend

PyTorchObserver {"prompt_tokens":11,"generated_tokens":139,"model_load_start_ms":1763919887194,"model_load_end_ms":1763919889191,"inference_start_ms":1763919889192,"inference_end_ms":1763919892839,"prompt_eval_end_ms":1763919889315,"first_token_ms":1763919889315,"aggregate_sampling_time_ms":227,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
<|im_start|>user
who are you<|im_end|>
<|im_start|>assistant:
<think>
Okay, the user is asking "who are you". I need to respond appropriately. Let me think about how to structure this.

First, I should acknowledge their question. It's important to be friendly and open. I can say something like, "Hi! I'm here to help you with your questions. How can I assist you today?"

Then, I can invite them to ask me anything. Maybe add a friendly emoji to keep the conversation going. Let me check the response again to make sure it's clear and friendly. Alright, that should cover it.
</think>

Hi! I'm here to help you with your questions. How can I assist you today? ๐Ÿ˜Š<|im_end|>

However, it did not work when exporting using the kv mode. It can export, but not run.

# Export
python examples/qualcomm/oss_scripts/llama/llama.py \
    -b build-android \
    -m SM8550 \
    --compile_only \
    --decoder_model qwen3-0_6b \
    --prompt "dummy" \
    --model_mode kv \
    --max_seq_len 1024 \
    --prefill_ar_len 128 \
    --temperature 0 \
    --dtype-override fp32 \
    --range_setting minmax \
    --artifact ./qwen3_06b_sm8550_pte

# Run
(executorch) n10288@N10288:~/Documents/Code/executorch/qwen3_06b_sm8550_pte$ adb shell "cd ${DEVICE_DIR} && ./qnn_llama_runner --decoder_model_version qwen3 --tokenizer_path tokenizer.json --model_path model_SM8550.pte --prompt 'who are you' --seq_len 512 --kv_updater SmartMask --eval_mode 1 --temperature 0.8 && cat outputs.txt "
I tokenizers:regex.cpp:27] Registering override fallback regex
I 00:00:00.004450 executorch:runner.cpp:146] creating module: model_path=model_SM8550.pte
I 00:00:00.004595 executorch:runner.cpp:147] creating runner: tokenizer_path=tokenizer.json
I 00:00:00.004621 executorch:runner.cpp:148] eval mode=1
I 00:00:00.004634 executorch:runner.cpp:149] kv updater=SmartMask
I tokenizers:hf_tokenizer.cpp:142] Setting up normalizer...
I tokenizers:normalizer.cpp:89] Using NFC normalizer. Please notice that our implementation may not handle all edge cases.
I tokenizers:hf_tokenizer.cpp:146] Normalizer set up
I tokenizers:hf_tokenizer.cpp:160] Setting up pretokenizer...
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1763912319.180404   22336 re2.cc:237] Error parsing '((?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s...': invalid perl operator: (?!
I tokenizers:re2_regex.cpp:27] Re2 failed to compile regex: ((?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+), error: invalid perl operator: (?!
This may be ok if a fallback regex is used.
I tokenizers:regex_lookahead.cpp:27] Creating PCRE2 regex
I tokenizers:hf_tokenizer.cpp:164] Pretokenizer set up
I tokenizers:hf_tokenizer.cpp:180] Loading BPE merges...
I tokenizers:hf_tokenizer.cpp:218] Loaded 151387 BPE merge rules
I tokenizers:hf_tokenizer.cpp:230] Built merge ranks map with 151387 entries
I 00:00:00.616125 executorch:llm_runner_helper.cpp:54] Loaded json tokenizer
I tokenizers:hf_tokenizer.cpp:393] normalized input: '' -> ''
E 00:00:00.616469 executorch:program.cpp:56] No method named 'kv_forward' in program
F 00:00:00.616482 executorch:result.h:170] In function CheckOk(), assert failed: hasValue_
Aborted 

luffy-yu avatar Nov 23 '25 17:11 luffy-yu

@cccclai After some effort, I have managed to run the QNN-backend demo on SoC 8550 Android.

I have uploaded it to LlamaDemo-Executorch-QNN.

The updated jni_layer_llama.cpp can be found at jni_layer_llama.cpp.

The necessary changes are summarized on QNN_ANDROID_FIX_SUMMARY.

luffy-yu avatar Nov 23 '25 23:11 luffy-yu

Thank you for the detailed documentation and get it working on your own. I actually have some pending PRs that I didn't manage to land https://github.com/pytorch/executorch/pull/15258. @haowhsu-quic @kirklandsign can we checkout the listed doc above and fix them either in our code base or docs?

cccclai avatar Nov 24 '25 19:11 cccclai

@cccclai You are welcome. It was not easy to debug. The outdated document impacted a lot.

luffy-yu avatar Nov 24 '25 20:11 luffy-yu

Yeah we should do a better job on updating the docs and provides a better user experience. @luffy-yu will you be willing to create a PR to update our readme? I think lots of content in https://github.com/luffy-yu/LlamaDemo-Executorch-QNN/blob/master/QNN_ANDROID_FIX_SUMMARY.md can be copied pasted in our doc already, especially the Verification Steps, and Common Issues and Solutions

For the code changes, they can be submitted as a separate PR

cccclai avatar Nov 24 '25 20:11 cccclai

@cccclai Sure, I can do that later this week. Could you specify which document to update, as they are scattered across two projects (this and executorch-examples) and the website.

luffy-yu avatar Nov 24 '25 20:11 luffy-yu

I think having them in the website will be awesome https://docs.pytorch.org/executorch/stable/backends-qualcomm.html

cccclai avatar Nov 24 '25 20:11 cccclai

I think having them in the website will be awesome https://docs.pytorch.org/executorch/stable/backends-qualcomm.html

I agree, but how can I create a PR for the website?

luffy-yu avatar Nov 24 '25 20:11 luffy-yu

The website is a markdown file in the executorch code base https://github.com/pytorch/executorch/blob/4d366239f473dce680fba477eafb693942d1600d/docs/source/backends-qualcomm.md?plain=1#L4

cccclai avatar Nov 24 '25 20:11 cccclai

The website is a markdown file in the executorch code base

executorch/docs/source/backends-qualcomm.md

Line 4 in 4d36623

build ExecuTorch for Qualcomm AI Engine Direct and running a model on it.

I see. Thank you for pointing it out. Then, I will create a PR for this file.

luffy-yu avatar Nov 24 '25 20:11 luffy-yu

Maybe follow the instructions in #15410

Tested. This also works for exporting the Qwen3-0.6B model for SM8450 SoC.

luffy-yu avatar Nov 25 '25 18:11 luffy-yu

@cccclai PR https://github.com/pytorch/executorch/pull/16011 has been submitted.

luffy-yu avatar Dec 01 '25 03:12 luffy-yu