llama.cpp [bug] the NPU backend achives around 1/3 performance of CPU

Hi @TerryT9, I encountered the same issue and I'm not sure how to resolve it. Could you share how you solved it?

Hi @Gianthard-cyh , would you mind to say which model you were using?

I'm using Llama-3.2-1B-Instruct-f16.gguf. By the way, I run the model successfully after increasing the size limit of MUL_MAT op and changing the precision option of NPU (the execution of Convert op will fail without this) .

However, as stated in previous issues, the NPU backend achives around 1/3 performance of CPU. I think more profiling and optimizing work could be done. I'm happy to help with that.

My device is Oneplus Ace 3 with Snapdragon 8 Gen 2.
--- a/ggml/src/ggml-qnn/graph.cpp
+++ b/ggml/src/ggml-qnn/graph.cpp
@@ -192,8 +192,15 @@ qnn_graph::qnn_graph(const std::string &graph_name, QNNBackend device, std::shar
         graph_vtcm_config.option = QNN_GRAPH_CONFIG_OPTION_CUSTOM;
         graph_vtcm_config.customConfig = &vtcm_config;

+        QnnHtpGraph_CustomConfig_t precision_config;
+        precision_config.option = QNN_HTP_GRAPH_CONFIG_OPTION_PRECISION;
+        precision_config.precision = QNN_PRECISION_FLOAT16;
+        QnnGraph_Config_t graph_precision_config;
+        graph_precision_config.option = QNN_GRAPH_CONFIG_OPTION_CUSTOM;
+        graph_precision_config.customConfig = &precision_config;
+
         const QnnGraph_Config_t *graph_configs[] = {&graph_hvx_config, &graph_dlbc_config, &graph_vtcm_config,
-                                                    &graph_opt_config, nullptr};
+                                                    &graph_opt_config, &graph_precision_config, nullptr};
         error = qnn_interface->qnn_graph_create(qnn_context, graph_name.c_str(), graph_configs, &graph_handle);
     } else {
         error = qnn_interface->qnn_graph_create(qnn_context, graph_name.c_str(), nullptr, &graph_handle);

Originally posted by @Gianthard-cyh in #20

Mar 03 '25 13:03 chraac

Hi @TerryT9, I encountered the same issue and I'm not sure how to resolve it. Could you share how you solved it?

Hi @Gianthard-cyh , would you mind to say which model you were using?

I'm using Llama-3.2-1B-Instruct-f16.gguf. By the way, I run the model successfully after increasing the size limit of MUL_MAT op and changing the precision option of NPU (the execution of Convert op will fail without this) . However, as stated in previous issues, the NPU backend achives around 1/3 performance of CPU. I think more profiling and optimizing work could be done. I'm happy to help with that. My device is Oneplus Ace 3 with Snapdragon 8 Gen 2.

nice one, you thing we can force the NPU use the F16 event in F32 mode?

thought we can try!

I'm not sure what this option does. It just works after adding this option; otherwise, a strange error occurs when executing the Convert node in the graph. I’ve also tried calling the op validation API before adding the node to the graph — the op configuration passed the validation, but the execution still failed.

Originally posted by @Gianthard-cyh in #20

Mar 03 '25 13:03 chraac

hi @Gianthard-cyh, thought we can talk here

Mar 03 '25 13:03 chraac

https://github.com/chraac/llama.cpp/issues/20#issuecomment-2694325282 by @chraac

tbh, while these NPUs claim impressive XX Tflops performance, these are peak numbers limited by various conditions like memory bandwidth. The mul_mat operation is too simple to fully utilize the NPU's processing power.

I think writing some fused ops may help (e.g. flashattention), I'm now researching on this. but I believe the first thing to be done is to setup a profiling infrastructure for the npu backend, so we can identify the bottleneck. This is what I am trying to do. Once I have any progress, I am willing to open a PR.

Mar 03 '25 13:03 Gianthard-cyh

I think writing some fused ops may help (e.g. flashattention), I'm now researching on this. but I believe the first thing to be done is to setup a profiling infrastructure for the npu backend, so we can identify the bottleneck. This is what I am trying to do. Once I have any progress, I am willing to open a PR.

Yeah, definitely welcome! Agree that we can run a profiler on it. Actually, I've been thinking there are 2 different levels:

We can utilize the Linux profiler to profile our graph mapping (there's a ticket in the backlog)
Use Qualcomm's profiler for profiling QNN internal usage (haven't figured out how to do this yet)

If you have any thoughts anytime, feel free to post here!

Mar 03 '25 13:03 chraac

Regarding your change here:

+        precision_config.precision = QNN_PRECISION_FLOAT16;

did you test it on F32 model? ahh, curious about whether we can force the percision here to F16 always

Mar 03 '25 13:03 chraac

Regarding your change here:
   precision_config.precision = QNN_PRECISION_FLOAT16;
did you test it on F32 model? ahh, curious about whether we can force the percision here to F16 always

Failed....

Mar 03 '25 13:03 chraac

Regarding your change here:
   precision_config.precision = QNN_PRECISION_FLOAT16;
did you test it on F32 model? ahh, curious about whether we can force the percision here to F16 always
Failed....

Fine, I'll test this later.

Mar 03 '25 13:03 Gianthard-cyh

Feel free to try my script for a quick prototype verification. It can run the QNN backend through Qualcomm's NPU emulator.

./llama-cpp-qnn-builder/docker/docker_compose_compile.sh -r -d --print-build-time --build-linux-x64 --run-tests

Mar 03 '25 13:03 chraac

On thing forgot to said yesterday, looks like the convert op was horribly slow in npu:

Unfortunately we discovered that the conversion operations as implemented on the NPU were extremely slow, much slower than the main matrix multiplication in fact. You can see the results in the npu_quant_profile.csv file in this repository, with conversions taking over 75% of the time.

https://github.com/usefulsensors/qc_npu_benchmark

Mar 04 '25 12:03 chraac

[dump_profile_info, 444]: [qnn-npu][MUL_MATf32_2048x2048f16_2048x5f32#RESHAPE(MUL_MAT)f32_64x32x5f32]qnn_profile: [ROOT][RPC (finalize) time] BACKEND : 4905 US
[dump_profile_info, 444]: [qnn-npu][MUL_MATf32_2048x2048f16_2048x5f32#RESHAPE(MUL_MAT)f32_64x32x5f32]qnn_profile: [ROOT][QNN accelerator (finalize) time] BACKEND : 4753 US
[dump_profile_info, 444]: [qnn-npu][MUL_MATf32_2048x2048f16_2048x5f32#RESHAPE(MUL_MAT)f32_64x32x5f32]qnn_profile: [ROOT][Accelerator (finalize) time] BACKEND : 4559 US
[dump_profile_info, 444]: [qnn-npu][MUL_MATf32_2048x2048f16_2048x5f32#RESHAPE(MUL_MAT)f32_64x32x5f32]qnn_profile: [ROOT][QNN (finalize) time] FINALIZE : 22743 US

[dump_profile_info, 444]: [qnn-npu][MUL_MATf32_2048x2048f16_2048x5f32#RESHAPE(MUL_MAT)f32_64x32x5f32]qnn_profile: [ROOT][Number of HVX threads used] BACKEND : 4 COUNT
[dump_profile_info, 444]: [qnn-npu][MUL_MATf32_2048x2048f16_2048x5f32#RESHAPE(MUL_MAT)f32_64x32x5f32]qnn_profile: [ROOT][RPC (execute) time] BACKEND : 3039 US
[dump_profile_info, 444]: [qnn-npu][MUL_MATf32_2048x2048f16_2048x5f32#RESHAPE(MUL_MAT)f32_64x32x5f32]qnn_profile: [ROOT][QNN accelerator (execute) time] BACKEND : 2880 US
[dump_profile_info, 444]: [qnn-npu][MUL_MATf32_2048x2048f16_2048x5f32#RESHAPE(MUL_MAT)f32_64x32x5f32]qnn_profile: [ROOT][Num times yield occured] BACKEND : 0 COUNT
[dump_profile_info, 444]: [qnn-npu][MUL_MATf32_2048x2048f16_2048x5f32#RESHAPE(MUL_MAT)f32_64x32x5f32]qnn_profile: [ROOT][Time for initial VTCM acquire] BACKEND : 2327 US
[dump_profile_info, 444]: [qnn-npu][MUL_MATf32_2048x2048f16_2048x5f32#RESHAPE(MUL_MAT)f32_64x32x5f32]qnn_profile: [ROOT][Time for HVX + HMX power on and acquire] BACKEND : 4781 US
[dump_profile_info, 444]: [qnn-npu][MUL_MATf32_2048x2048f16_2048x5f32#RESHAPE(MUL_MAT)f32_64x32x5f32]qnn_profile: [ROOT][Accelerator (execute) time (cycles)] BACKEND : 4974277 CYCLES
[dump_profile_info, 461]: [qnn-npu][MUL_MATf32_2048x2048f16_2048x5f32#RESHAPE(MUL_MAT)f32_64x32x5f32]qnn_profile: [SUB][Number of HVX threads used] BACKEND : 4 COUNT
[dump_profile_info, 461]: [qnn-npu][MUL_MATf32_2048x2048f16_2048x5f32#RESHAPE(MUL_MAT)f32_64x32x5f32]qnn_profile: [SUB][RPC (execute) time] BACKEND : 3039 US
[dump_profile_info, 461]: [qnn-npu][MUL_MATf32_2048x2048f16_2048x5f32#RESHAPE(MUL_MAT)f32_64x32x5f32]qnn_profile: [SUB][QNN accelerator (execute) time] BACKEND : 2880 US
[dump_profile_info, 461]: [qnn-npu][MUL_MATf32_2048x2048f16_2048x5f32#RESHAPE(MUL_MAT)f32_64x32x5f32]qnn_profile: [SUB][Num times yield occured] BACKEND : 0 COUNT
[dump_profile_info, 461]: [qnn-npu][MUL_MATf32_2048x2048f16_2048x5f32#RESHAPE(MUL_MAT)f32_64x32x5f32]qnn_profile: [SUB][Time for initial VTCM acquire] BACKEND : 2327 US
[dump_profile_info, 461]: [qnn-npu][MUL_MATf32_2048x2048f16_2048x5f32#RESHAPE(MUL_MAT)f32_64x32x5f32]qnn_profile: [SUB][Time for HVX + HMX power on and acquire] BACKEND : 4781 US
[dump_profile_info, 444]: [qnn-npu][MUL_MATf32_2048x2048f16_2048x5f32#RESHAPE(MUL_MAT)f32_64x32x5f32]qnn_profile: [ROOT][Accelerator (execute) time] BACKEND : 2264 US
[dump_profile_info, 444]: [qnn-npu][MUL_MATf32_2048x2048f16_2048x5f32#RESHAPE(MUL_MAT)f32_64x32x5f32]qnn_profile: [ROOT][Accelerator (execute excluding wait) time] BACKEND : 1276 US
[dump_profile_info, 444]: [qnn-npu][MUL_MATf32_2048x2048f16_2048x5f32#RESHAPE(MUL_MAT)f32_64x32x5f32]qnn_profile: [ROOT][QNN (execute) time] EXECUTE : 3539 US

I've set up a very simple simple profiler. It shows graph finalization has an expensive overhead. It takes ~6.42x time compared to execution.

Mar 06 '25 03:03 Gianthard-cyh

I've set up a very simple simple profiler. It shows graph finalization have an expensive overhead. It takes ~6.42x time compared to execution.

Yeah, nice! In the current codebase, the graph finalization is a one-time process. We have a built-in graph cache (see: here) to store all finalized (or compiled) graphs, which can later be reused without this extra cost.

In other words, with the current QNN architecture, we probably don't have many ways to reduce the finalize latency—we can only hide it by caching or adding more op into one graph, unless Qualcomm provides additional assistance.

Given there isn't much headroom to improve in the finalization process, what about focusing on another aspect, like per-op optimization?

Mar 06 '25 03:03 chraac

I've extracted some op-level profiling information.

[dump_profile_info, 444]: [qnn-npu][MUL_MATf32_2048x2048f16_2048x5f32#RESHAPE(MUL_MAT)f32_64x32x5f32]qnn_profile: [ROOT][RPC (finalize) time] BACKEND : 5486 US
[dump_profile_info, 444]: [qnn-npu][MUL_MATf32_2048x2048f16_2048x5f32#RESHAPE(MUL_MAT)f32_64x32x5f32]qnn_profile: [ROOT][QNN accelerator (finalize) time] BACKEND : 5344 US
[dump_profile_info, 444]: [qnn-npu][MUL_MATf32_2048x2048f16_2048x5f32#RESHAPE(MUL_MAT)f32_64x32x5f32]qnn_profile: [ROOT][Accelerator (finalize) time] BACKEND : 5153 US
[dump_profile_info, 444]: [qnn-npu][MUL_MATf32_2048x2048f16_2048x5f32#RESHAPE(MUL_MAT)f32_64x32x5f32]qnn_profile: [ROOT][QNN (finalize) time] FINALIZE : 21427 US
[dump_profile_info, 444]: [qnn-npu][MUL_MATf32_2048x2048f16_2048x5f32#RESHAPE(MUL_MAT)f32_64x32x5f32]qnn_profile: [ROOT][Number of HVX threads used] BACKEND : 4 COUNT
[dump_profile_info, 444]: [qnn-npu][MUL_MATf32_2048x2048f16_2048x5f32#RESHAPE(MUL_MAT)f32_64x32x5f32]qnn_profile: [ROOT][RPC (execute) time] BACKEND : 3011 US
[dump_profile_info, 444]: [qnn-npu][MUL_MATf32_2048x2048f16_2048x5f32#RESHAPE(MUL_MAT)f32_64x32x5f32]qnn_profile: [ROOT][QNN accelerator (execute) time] BACKEND : 2846 US
[dump_profile_info, 444]: [qnn-npu][MUL_MATf32_2048x2048f16_2048x5f32#RESHAPE(MUL_MAT)f32_64x32x5f32]qnn_profile: [ROOT][Num times yield occured] BACKEND : 0 COUNT
[dump_profile_info, 444]: [qnn-npu][MUL_MATf32_2048x2048f16_2048x5f32#RESHAPE(MUL_MAT)f32_64x32x5f32]qnn_profile: [ROOT][Time for initial VTCM acquire] BACKEND : 2383 US
[dump_profile_info, 444]: [qnn-npu][MUL_MATf32_2048x2048f16_2048x5f32#RESHAPE(MUL_MAT)f32_64x32x5f32]qnn_profile: [ROOT][Time for HVX + HMX power on and acquire] BACKEND : 4730 US
[dump_profile_info, 444]: [qnn-npu][MUL_MATf32_2048x2048f16_2048x5f32#RESHAPE(MUL_MAT)f32_64x32x5f32]qnn_profile: [ROOT][Accelerator (execute) time (cycles)] BACKEND : 4853298 CYCLES
[dump_profile_info, 461]: [qnn-npu][MUL_MATf32_2048x2048f16_2048x5f32#RESHAPE(MUL_MAT)f32_64x32x5f32]qnn_profile: [SUB][Input OpId_2 (cycles)] NODE : 130916 CYCLES
[dump_profile_info, 461]: [qnn-npu][MUL_MATf32_2048x2048f16_2048x5f32#RESHAPE(MUL_MAT)f32_64x32x5f32]qnn_profile: [SUB][convert_src0:OpId_17 (cycles)] NODE : 0 CYCLES
[dump_profile_info, 461]: [qnn-npu][MUL_MATf32_2048x2048f16_2048x5f32#RESHAPE(MUL_MAT)f32_64x32x5f32]qnn_profile: [SUB][Qcur-0:OpId_20 (cycles)] NODE : 2170200 CYCLES
[dump_profile_info, 461]: [qnn-npu][MUL_MATf32_2048x2048f16_2048x5f32#RESHAPE(MUL_MAT)f32_64x32x5f32]qnn_profile: [SUB][Qcur-0_trans1:OpId_24 (cycles)] NODE : 2120137 CYCLES
[dump_profile_info, 461]: [qnn-npu][MUL_MATf32_2048x2048f16_2048x5f32#RESHAPE(MUL_MAT)f32_64x32x5f32]qnn_profile: [SUB][Qcur-0 (reshaped):OpId_26 (cycles)] NODE : 12533 CYCLES
[dump_profile_info, 461]: [qnn-npu][MUL_MATf32_2048x2048f16_2048x5f32#RESHAPE(MUL_MAT)f32_64x32x5f32]qnn_profile: [SUB][Output OpId_3 (cycles)] NODE : 419512 CYCLES
[dump_profile_info, 444]: [qnn-npu][MUL_MATf32_2048x2048f16_2048x5f32#RESHAPE(MUL_MAT)f32_64x32x5f32]qnn_profile: [ROOT][Accelerator (execute) time] BACKEND : 2252 US
[dump_profile_info, 444]: [qnn-npu][MUL_MATf32_2048x2048f16_2048x5f32#RESHAPE(MUL_MAT)f32_64x32x5f32]qnn_profile: [ROOT][Accelerator (execute excluding wait) time] BACKEND : 1259 US

Surprisingly, the convert operator takes 0 cycles. This might be related to the F16 setting I'm using. I'll try F32 later to check. Also noticed the transpose op takes about 50% time.

Mar 08 '25 05:03 Gianthard-cyh

[dump_profile_info, 444]: [qnn-npu][MUL_MATf32_2048x8192f16_2048x5f32]qnn_profile: [ROOT][Number of HVX threads used] BACKEND : 4 COUNT
[dump_profile_info, 444]: [qnn-npu][MUL_MATf32_2048x8192f16_2048x5f32]qnn_profile: [ROOT][RPC (execute) time] BACKEND : 3895 US
[dump_profile_info, 444]: [qnn-npu][MUL_MATf32_2048x8192f16_2048x5f32]qnn_profile: [ROOT][QNN accelerator (execute) time] BACKEND : 3859 US
[dump_profile_info, 444]: [qnn-npu][MUL_MATf32_2048x8192f16_2048x5f32]qnn_profile: [ROOT][Num times yield occured] BACKEND : 0 COUNT
[dump_profile_info, 444]: [qnn-npu][MUL_MATf32_2048x8192f16_2048x5f32]qnn_profile: [ROOT][Time for initial VTCM acquire] BACKEND : 1627 US
[dump_profile_info, 444]: [qnn-npu][MUL_MATf32_2048x8192f16_2048x5f32]qnn_profile: [ROOT][Time for HVX + HMX power on and acquire] BACKEND : 2047 US
[dump_profile_info, 444]: [qnn-npu][MUL_MATf32_2048x8192f16_2048x5f32]qnn_profile: [ROOT][Accelerator (execute) time (cycles)] BACKEND : 12914445 CYCLES
[dump_profile_info, 461]: [qnn-npu][MUL_MATf32_2048x8192f16_2048x5f32]qnn_profile: [SUB][Input OpId_2 (cycles)] NODE : 119521 CYCLES
[dump_profile_info, 461]: [qnn-npu][MUL_MATf32_2048x8192f16_2048x5f32]qnn_profile: [SUB][convert_src0:OpId_60 (cycles)] NODE : 0 CYCLES
[dump_profile_info, 461]: [qnn-npu][MUL_MATf32_2048x8192f16_2048x5f32]qnn_profile: [SUB][ffn_gate-0:OpId_63 (cycles)] NODE : 4397240 CYCLES
[dump_profile_info, 461]: [qnn-npu][MUL_MATf32_2048x8192f16_2048x5f32]qnn_profile: [SUB][ffn_gate-0_trans1:OpId_67 (cycles)] NODE : 8246209 CYCLES
[dump_profile_info, 461]: [qnn-npu][MUL_MATf32_2048x8192f16_2048x5f32]qnn_profile: [SUB][Output OpId_3 (cycles)] NODE : 151475 CYCLES
[dump_profile_info, 444]: [qnn-npu][MUL_MATf32_2048x8192f16_2048x5f32]qnn_profile: [ROOT][Accelerator (execute) time] BACKEND : 3441 US
[dump_profile_info, 444]: [qnn-npu][MUL_MATf32_2048x8192f16_2048x5f32]qnn_profile: [ROOT][Accelerator (execute excluding wait) time] BACKEND : 2649 US
[dump_profile_info, 444]: [qnn-npu][MUL_MATf32_2048x8192f16_2048x5f32]qnn_profile: [ROOT][QNN (execute) time] EXECUTE : 5645 US

Here is the FFN block.

Mar 08 '25 06:03 Gianthard-cyh

Surprisingly, the convert operator takes 0 cycles. This might be related to the F16 setting I'm using. I'll try F32 later to check. Also noticed the transpose op takes about 50% time.

excellent observation! regarding the transpose operation consuming ~50% of the execution time - transpose operations typically involve complex memory access patterns rather than simple computational work, which often makes them expensive despite appearing simple. imo, most hardware architectures struggle with the non-contiguous memory access patterns required by transpose operations.

Mar 08 '25 06:03 chraac

From the hexagon block diagram here (found in this artical: Qualcomm’s Hexagon DSP, and now, NPU -- Chips and Cheese):

appears that there's a TCM inside with only 8MB capacity for v73 arch, thought is insufficient to cache relatively large tensors, which significantly impacts the performance of operations requiring non-contiguous memory access patterns (like transpose)

do we have any memory access metrics available in the current profiler that could help us analyze this bottleneck better? understanding the memory access patterns could help optimize our operations to work within these hardware constraints.

Mar 08 '25 06:03 chraac

Update here:

integrated QNN NPU event tracing func into my fork and conducted a test on my 8gen 2 device. rsult list below, we can have a further look

event for graph MUL_MATf32_2048x512q4_K_2048x2f32f16_1024f16, 25.070 ms:

[profiler][MUL_MATf32_2048x512q4_K_2048x2f32f16_1024f16]print_profile_events start ----------------
[profiler][MUL_MATf32_2048x512q4_K_2048x2f32f16_1024f16]event[0]: Number of HVX threads used, count: 4
[profiler][MUL_MATf32_2048x512q4_K_2048x2f32f16_1024f16]event[1]: RPC (execute) time, duration: 29.409 ms
[profiler][MUL_MATf32_2048x512q4_K_2048x2f32f16_1024f16]event[2]: QNN accelerator (execute) time, duration: 25.280 ms
[profiler][MUL_MATf32_2048x512q4_K_2048x2f32f16_1024f16]event[3]: Num times yield occured, count: 0
[profiler][MUL_MATf32_2048x512q4_K_2048x2f32f16_1024f16]event[4]: Time for initial VTCM acquire, duration: 0.839 ms
[profiler][MUL_MATf32_2048x512q4_K_2048x2f32f16_1024f16]event[5]: Time for HVX + HMX power on and acquire, duration: 1.121 ms
[profiler][MUL_MATf32_2048x512q4_K_2048x2f32f16_1024f16]event[6]: Accelerator (execute) time (cycles), sub_count: 3, start -------------
[profiler][MUL_MATf32_2048x512q4_K_2048x2f32f16_1024f16]sub_event[0]: Input OpId_2 (cycles), cycles: 340526
[profiler][MUL_MATf32_2048x512q4_K_2048x2f32f16_1024f16]sub_event[1]: Vcur-2:OpId_45 (cycles), cycles: 35690645
[profiler][MUL_MATf32_2048x512q4_K_2048x2f32f16_1024f16]sub_event[2]: Output OpId_3 (cycles), cycles: 10642
[profiler][MUL_MATf32_2048x512q4_K_2048x2f32f16_1024f16]event[6]: Accelerator (execute) time (cycles), cycles: 36041813, end --------------
[profiler][MUL_MATf32_2048x512q4_K_2048x2f32f16_1024f16]event[7]: Accelerator (execute) time, duration: 25.070 ms
[profiler][MUL_MATf32_2048x512q4_K_2048x2f32f16_1024f16]event[8]: Accelerator (execute excluding wait) time, duration: 24.541 ms
[profiler][MUL_MATf32_2048x512q4_K_2048x2f32f16_1024f16]event[9]: Unknown event type, 
[profiler][MUL_MATf32_2048x512q4_K_2048x2f32f16_1024f16]event[10]: QNN (execute) time, duration: 30.237 ms
[profiler][MUL_MATf32_2048x512q4_K_2048x2f32f16_1024f16]print_profile_events end -----------------

graph MUL_MATf32_2048x2048q4_K_2048x2f32#ADD(MUL_MAT,ADD)f32_2048x2f32, 71.114 ms:

[profiler][MUL_MATf32_2048x2048q4_K_2048x2f32#ADD(MUL_MAT,ADD)f32_2048x2f32]print_profile_events start ----------------
[profiler][MUL_MATf32_2048x2048q4_K_2048x2f32#ADD(MUL_MAT,ADD)f32_2048x2f32]event[0]: Number of HVX threads used, count: 4
[profiler][MUL_MATf32_2048x2048q4_K_2048x2f32#ADD(MUL_MAT,ADD)f32_2048x2f32]event[1]: RPC (execute) time, duration: 75.887 ms
[profiler][MUL_MATf32_2048x2048q4_K_2048x2f32#ADD(MUL_MAT,ADD)f32_2048x2f32]event[2]: QNN accelerator (execute) time, duration: 71.470 ms
[profiler][MUL_MATf32_2048x2048q4_K_2048x2f32#ADD(MUL_MAT,ADD)f32_2048x2f32]event[3]: Num times yield occured, count: 0
[profiler][MUL_MATf32_2048x2048q4_K_2048x2f32#ADD(MUL_MAT,ADD)f32_2048x2f32]event[4]: Time for initial VTCM acquire, duration: 1.051 ms
[profiler][MUL_MATf32_2048x2048q4_K_2048x2f32#ADD(MUL_MAT,ADD)f32_2048x2f32]event[5]: Time for HVX + HMX power on and acquire, duration: 1.131 ms
[profiler][MUL_MATf32_2048x2048q4_K_2048x2f32#ADD(MUL_MAT,ADD)f32_2048x2f32]event[6]: Accelerator (execute) time (cycles), sub_count: 4, start -------------
[profiler][MUL_MATf32_2048x2048q4_K_2048x2f32#ADD(MUL_MAT,ADD)f32_2048x2f32]sub_event[0]: Input OpId_2 (cycles), cycles: 10321
[profiler][MUL_MATf32_2048x2048q4_K_2048x2f32#ADD(MUL_MAT,ADD)f32_2048x2f32]sub_event[1]: node_55:OpId_40 (cycles), cycles: 207758162
[profiler][MUL_MATf32_2048x2048q4_K_2048x2f32#ADD(MUL_MAT,ADD)f32_2048x2f32]sub_event[2]: ffn_inp-1:OpId_44 (cycles), cycles: 1054393
[profiler][MUL_MATf32_2048x2048q4_K_2048x2f32#ADD(MUL_MAT,ADD)f32_2048x2f32]sub_event[3]: Output OpId_3 (cycles), cycles: 10415
[profiler][MUL_MATf32_2048x2048q4_K_2048x2f32#ADD(MUL_MAT,ADD)f32_2048x2f32]event[6]: Accelerator (execute) time (cycles), cycles: 208833291, end --------------
[profiler][MUL_MATf32_2048x2048q4_K_2048x2f32#ADD(MUL_MAT,ADD)f32_2048x2f32]event[7]: Accelerator (execute) time, duration: 71.114 ms
[profiler][MUL_MATf32_2048x2048q4_K_2048x2f32#ADD(MUL_MAT,ADD)f32_2048x2f32]event[8]: Accelerator (execute excluding wait) time, duration: 70.559 ms
[profiler][MUL_MATf32_2048x2048q4_K_2048x2f32#ADD(MUL_MAT,ADD)f32_2048x2f32]event[9]: Unknown event type, 
[profiler][MUL_MATf32_2048x2048q4_K_2048x2f32#ADD(MUL_MAT,ADD)f32_2048x2f32]event[10]: QNN (execute) time, duration: 77.248 ms
[profiler][MUL_MATf32_2048x2048q4_K_2048x2f32#ADD(MUL_MAT,ADD)f32_2048x2f32]print_profile_events end -----------------

graph MUL_MATf32_2048x8192q4_K_2048x2f32, 188.732 ms:

[profiler][MUL_MATf32_2048x8192q4_K_2048x2f32]print_profile_events start ----------------
[profiler][MUL_MATf32_2048x8192q4_K_2048x2f32]event[0]: Number of HVX threads used, count: 4
[profiler][MUL_MATf32_2048x8192q4_K_2048x2f32]event[1]: RPC (execute) time, duration: 193.734 ms
[profiler][MUL_MATf32_2048x8192q4_K_2048x2f32]event[2]: QNN accelerator (execute) time, duration: 189.383 ms
[profiler][MUL_MATf32_2048x8192q4_K_2048x2f32]event[3]: Num times yield occured, count: 0
[profiler][MUL_MATf32_2048x8192q4_K_2048x2f32]event[4]: Time for initial VTCM acquire, duration: 0.871 ms
[profiler][MUL_MATf32_2048x8192q4_K_2048x2f32]event[5]: Time for HVX + HMX power on and acquire, duration: 1.143 ms
[profiler][MUL_MATf32_2048x8192q4_K_2048x2f32]event[6]: Accelerator (execute) time (cycles), sub_count: 3, start -------------
[profiler][MUL_MATf32_2048x8192q4_K_2048x2f32]sub_event[0]: Input OpId_2 (cycles), cycles: 6347
[profiler][MUL_MATf32_2048x8192q4_K_2048x2f32]sub_event[1]: ffn_gate-0:OpId_31 (cycles), cycles: 740788319
[profiler][MUL_MATf32_2048x8192q4_K_2048x2f32]sub_event[2]: Output OpId_3 (cycles), cycles: 13242
[profiler][MUL_MATf32_2048x8192q4_K_2048x2f32]event[6]: Accelerator (execute) time (cycles), cycles: 740807908, end --------------
[profiler][MUL_MATf32_2048x8192q4_K_2048x2f32]event[7]: Accelerator (execute) time, duration: 188.732 ms
[profiler][MUL_MATf32_2048x8192q4_K_2048x2f32]event[8]: Accelerator (execute excluding wait) time, duration: 188.183 ms
[profiler][MUL_MATf32_2048x8192q4_K_2048x2f32]event[9]: Unknown event type, 
[profiler][MUL_MATf32_2048x8192q4_K_2048x2f32]event[10]: QNN (execute) time, duration: 199.446 ms
[profiler][MUL_MATf32_2048x8192q4_K_2048x2f32]print_profile_events end -----------------

Mar 17 '25 16:03 chraac

hi @Gianthard-cyh , how are things going on your side? added this issue to project backlog and will take a more detailed look when I have some time available.

Mar 20 '25 08:03 chraac

I've discovered a repo that implemented all ops needed in LLMs. https://github.com/UbiquitousLearning/mllm/tree/main. About memory profiling, I didn't find the corresponding API.

Mar 20 '25 08:03 Gianthard-cyh

I've discovered a repo that implemented all ops needed in LLMs. https://github.com/UbiquitousLearning/mllm/tree/main. About memory profiling, I didn't find the corresponding API.

nice one! did you test the performance on it with 8gen 2 phone?

Mar 20 '25 08:03 chraac

The prefilling stage is performered by QNN & CPU, and the inference stage is performed by CPU.

interesting, during the prefill stage, we typically deal with larger tensors, which gives us more opportunities to leverage GPU/NPU capabilities. I'll take a closer look at this.

Mar 20 '25 08:03 chraac

Hi @Gianthard-cyh, recently we've add the new hexagon-npu backend that is totally independent from qnn and parts of source code run inside QuRT, can manipulate the HVX registers, similar to what MLLM did, could have a look if you still interested.

Apr 21 '25 07:04 chraac

@chraac @Gianthard-cyh I tested the QNN backend on Snapdragon 8 Gen 4 and found that bind_tensor accounts for 84% of the time (46500ms per decode), while qnn_graph->execute uses 14% (7800ms). This indicates a significant bottleneck in the tensor binding phase. Notably, Qualcomm AI Hub(https://aihub.qualcomm.com/mobile/models/qwen2_7b_instruct?searchTerm=qwen) achieves faster performance with a 13.7 tokens/s response rate, which suggests room for optimization in the current QNN implementation.

May 15 '25 11:05 Dantetang

@chraac @Gianthard-cyh I tested the QNN backend on Snapdragon 8 Gen 4 and found that bind_tensor accounts for 84% of the time (46500ms per decode), while qnn_graph->execute uses 14% (7800ms). This indicates a significant bottleneck in the tensor binding phase. Notably, Qualcomm AI Hub(https://aihub.qualcomm.com/mobile/models/qwen2_7b_instruct?searchTerm=qwen) achieves faster performance with a 13.7 tokens/s response rate, which suggests room for optimization in the current QNN implementation.

Hi @Dantetang , thank you for the testing!

Are you using the quantized model? if so, believed that the time comsume on bind_tensor is caused by the cpu dequantization, since there's no direct dequant in qnn.

Thought we can gated it under some prefined macro, but that will disable offload some kind of MUL_MAT op to qnn, gonna have a patch for it soon.

Also now focus on improving the hexagon-npu backend that do not relied on qnn, gonna spend more time on it

May 15 '25 16:05 chraac

@chraac ,thank you for your advice.

I'm using the qwen2-1_5b-instruct-fp16.gguf model, and according to the debug logs, the MUL_MAT operations are being offloaded to QNN. Here's the relevant debug log output.Notably, the debug log indicate that the qnn backend requires converting f16 type to f32 format.

 graph name is MULf32_1536f32_1536f32#MUL_MAT(NONE,MUL)#ADD(MUL_MAT,NONE)#RESHAPE(ADD)f32_128x12f32
[profiler][qnn-npu][4]device_compute_graph(include get_qnn_graph_from_cache & qnn_graph->execute), duration time: 115.9010 ms
 graph name is MUL_MATf32_1536x256f32_1536f32#ADD(MUL_MAT,NONE)#RESHAPE(ADD)f32_128x2f32
[profiler][qnn-npu][3]device_compute_graph(include get_qnn_graph_from_cache & qnn_graph->execute), duration time: 24.2560 ms
 graph name is MUL_MATf32_1536x256f32_1536f32#ADD(MUL_MAT,NONE)#RESHAPE(ADD)f16_256f16
[profiler][qnn-npu][4]device_compute_graph(include get_qnn_graph_from_cache & qnn_graph->execute), duration time: 25.9500 ms
 graph name is MUL_MATf32_1536x1536f32_1536f32#ADD(MUL_MAT,GET_ROWS)f32_1536f32
[profiler][qnn-npu][2]device_compute_graph(include get_qnn_graph_from_cache & qnn_graph->execute), duration time: 113.7300 ms
 graph name is MULf32_1536f32_1536f32#MUL_MAT(NONE,MUL)f32_8960f32
[profiler][qnn-npu][2]device_compute_graph(include get_qnn_graph_from_cache & qnn_graph->execute), duration time: 484.1350 ms
 graph name is MUL_MATf32_1536x8960f32_1536f32#MUL(SILU,MUL_MAT)#MUL_MAT(NONE,MUL)#ADD(MUL_MAT,ADD)f32_1536f32
[profiler][qnn-npu][4]device_compute_graph(include get_qnn_graph_from_cache & qnn_graph->execute), duration time: 867.9380 ms
 graph name is MULf32_1536f32_1536f32#MUL_MAT(NONE,MUL)#ADD(MUL_MAT,NONE)#RESHAPE(ADD)f32_128x12f32
[profiler][qnn-npu][4]device_compute_graph(include get_qnn_graph_from_cache & qnn_graph->execute), duration time: 104.3500 ms
 graph name is MUL_MATf32_1536x256f32_1536f32#ADD(MUL_MAT,NONE)#RESHAPE(ADD)f32_128x2f32
[profiler][qnn-npu][3]device_compute_graph(include get_qnn_graph_from_cache & qnn_graph->execute), duration time: 32.1500 ms
 graph name is MUL_MATf32_1536x256f32_1536f32#ADD(MUL_MAT,NONE)#RESHAPE(ADD)f16_256f16
[profiler][qnn-npu][4]device_compute_graph(include get_qnn_graph_from_cache & qnn_graph->execute), duration time: 28.1820 ms
 graph name is MUL_MATf32_1536x1536f32_1536f32#ADD(MUL_MAT,ADD)f32_1536f32
[profiler][qnn-npu][2]device_compute_graph(include get_qnn_graph_from_cache & qnn_graph->execute), duration time: 114.6940 ms
 graph name is MULf32_1536f32_1536f32#MUL_MAT(NONE,MUL)f32_8960f32
[profiler][qnn-npu][2]device_compute_graph(include get_qnn_graph_from_cache & qnn_graph->execute), duration time: 502.1110 ms
 graph name is MUL_MATf32_1536x8960f32_1536f32#MUL(SILU,MUL_MAT)#MUL_MAT(NONE,MUL)#ADD(MUL_MAT,ADD)f32_1536f32
[profiler][qnn-npu][4]device_compute_graph(include get_qnn_graph_from_cache & qnn_graph->execute), duration time: 819.2710 ms
 graph name is MULf32_1536f32_1536f32#MUL_MAT(NONE,MUL)#ADD(MUL_MAT,NONE)#RESHAPE(ADD)f32_128x12f32
[profiler][qnn-npu][4]device_compute_graph(include get_qnn_graph_from_cache & qnn_graph->execute), duration time: 112.3110 ms
 graph name is MUL_MATf32_1536x256f32_1536f32#ADD(MUL_MAT,NONE)#RESHAPE(ADD)f32_128x2f32

May 16 '25 03:05 Dantetang

Hi @Dantetang @Gianthard-cyh @cm4ker, sorry for bothering! We've applied some optimizations to the hexagon-npu backend that now utilize HVX instructions. While it's still slower than the CPU backend, the performance is significantly improved compared to before. Feel free to give it a try if you're interested. For a technical overview, you can check it out here

May 27 '25 16:05 chraac

Hi @Dantetang @Gianthard-cyh @cm4ker, sorry for bothering! We've applied some optimizations to the hexagon-npu backend that now utilize HVX instructions. While it's still slower than the CPU backend, the performance is significantly improved compared to before. Feel free to give it a try if you're interested. For a technical overview, you can check it out here

@chraac I'm glad to help with testing! But I have a question: My laptop CPU is a Snapdragon X Elite processor (with Hexagon NPU - Snapdragon(R) X Elite - X1E80100 - Qualcomm(R) Hexagon(TM) NPU).

I successfully built llama with QNN support, but when I try to compile Hexagon support, it seems the SDK I downloaded (HEXAGON_SDK_ROOT env) is wrong. Can you point me to where I can download it?

May 29 '25 15:05 cm4ker

I successfully built llama with QNN support, but when I try to compile Hexagon support, it seems the SDK I downloaded (HEXAGON_SDK_ROOT env) is wrong. Can you point me to where I can download it?

Sure! To download Hexagon SDK, first need an qualcomm account, and then download using their package manager, called QPM, can have a look on their website: https://docs.qualcomm.com/bundle/publicresource/topics/80-77512-1/hexagon-dsp-sdk-getting-started.html?product=1601111740010422 It did take some time to setup this sdk, if encounter any problem, feel free to create discussion, will help Gonna add a page to detail the build instruction here soon...

May 29 '25 16:05 chraac

Create a discussion here for the windows build instructions of hexagon-npu backend: #44

May 30 '25 03:05 chraac

Hi, I'm curious about the current implementation for matmul. I've noticed that implementations like ExecuTorch and PowerServe convert matmul into QNN convolution to achieve the desired performance. However, if the implementation solely relies on HVX and scalar instructions, it's unlikely to reach the expected performance. As far as I know, the vast majority of the computational power of the Hexagon NPU comes from its HMX units (capable of dozens of TFLOPS), but unfortunately, these units are not publicly accessible. My measurements indicate that HVX only offers computational power in the range of tens to hundreds of GFLOPS. Therefore, even if the overhead from the QNN framework and communication is completely eliminated, operators implemented using HVX are unlikely to be significantly faster than those implemented on the CPU.

Jun 24 '25 02:06 haozixu

Hi, I'm curious about the current implementation for matmul. I've noticed that implementations like ExecuTorch and PowerServe convert matmul into QNN convolution to achieve the desired performance. However, if the implementation solely relies on HVX and scalar instructions, it's unlikely to reach the expected performance. As far as I know, the vast majority of the computational power of the Hexagon NPU comes from its HMX units (capable of dozens of TFLOPS), but unfortunately, these units are not publicly accessible. My measurements indicate that HVX only offers computational power in the range of tens to hundreds of GFLOPS. Therefore, even if the overhead from the QNN framework and communication is completely eliminated, operators implemented using HVX are unlikely to be significantly faster than those implemented on the CPU.

Hi there,

Regarding the matrix multiplication discussion:

I've noticed that implementations like ExecuTorch and PowerServe convert matmul into QNN convolution to achieve the desired performance.

Could you share the specific links you're referring to? From my investigation of executorch, their mul_mat implementation appears to be a straightforward wrapper around QNN's QNN_OP_PACKAGE_NAME_QTI_AISW.MatMul (see op_matmul.py#L54 and qnn_constants.py#L307). I'd appreciate seeing examples of the conv-based approach you mentioned.

Regarding performance considerations:

the vast majority of the computational power of the Hexagon NPU comes from its HMX units (capable of dozens of TFLOPS), but unfortunately, these units are not publicly accessible. My measurements indicate that HVX only offers computational power in the range of tens to hundreds of GFLOPS.

While we cannot directly access the HMX instruction set, I suspect the performance gap may not be as significant as expected. Based on my experience, the HVX instructions we use are likely macro instructions that get decomposed into micro-ops and executed on similar underlying units as HMX. If instruction dispatch isn't the bottleneck, migrating from HVX to HMX might not yield dramatic improvements.

The real constraint in mobile inference is memory bandwidth. For instance, the latest Snapdragon 8 Elite has only 84.8 GB/s of memory bandwidth, which becomes the limiting factor, especially for single-operator graphs where you must read-compute-write sequentially. As I discussed here, memory bandwidth constraints typically limit performance to hundreds of GFLOPS—far below the NPU's theoretical peak.

Could you share more details about your test methodology and target device? I suspect the performance gap between HVX and HMX may be smaller than anticipated in practice, though without HMX access, this remains a hypothesis worth validating...

Jun 28 '25 05:06 chraac