llama.cpp llama-cli on Hexagon-NPU introducing a lot of extra time

I find something weird about Hexagon-NPU backend performance. Firstly I run test-backend-ops on Android 8gen3, the result is normal:

  MUL_MAT(type_a=f16,type_b=f32,m=16416,n=1,k=128,bs=[8,1],nr=[4,1],per=[0,2,1,3],v=0):                  744 runs - 22486.85 us/run - 134.48 MFLOP/run -   5.98 GFLOPS
  MUL_MAT(type_a=f16,type_b=f32,m=128,n=1,k=16416,bs=[8,1],nr=[4,1],per=[0,1,2,3],v=1):                  744 runs - 13501.82 us/run - 134.48 MFLOP/run -   9.96 GFLOPS
  MUL_MAT(type_a=f32,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                 852 runs -  9510.09 us/run - 117.44 MFLOP/run -  12.35 GFLOPS
  MUL_MAT(type_a=f16,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                 852 runs -  8131.15 us/run - 117.44 MFLOP/run -  14.44 GFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                852 runs -  4886.67 us/run - 117.44 MFLOP/run -  24.03 GFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                852 runs - 19410.53 us/run - 117.44 MFLOP/run -   6.05 GFLOPS
  MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                852 runs - 408029.35 us/run - 117.44 MFLOP/run - 287.82 MFLOPS

But when I run llama-cli, the tg performance is very bad:

llama_perf_sampler_print:    sampling time =       0.87 ms /    45 runs   (    0.02 ms per token, 51605.50 tokens per second)
llama_perf_context_print:        load time =    5733.53 ms
llama_perf_context_print: prompt eval time =    3360.59 ms /    42 tokens (   80.01 ms per token,    12.50 tokens per second)
llama_perf_context_print:        eval time =    2368.24 ms /     2 runs   ( 1184.12 ms per token,     0.84 tokens per second)
llama_perf_context_print:       total time =    8105.73 ms /    44 tokens

Even if ops on NPU is not very fast, it shouldnot be so bad. So I print time-cost about subgraph(NPU+CPU) compute of llama-cli by ggml_time_us, something very weird:

[CPU]total: 102.017 ms  // CPU ggml_backend_graph_compute_async time-cost
[hexagon-npu]total: 367.552 ms  // NPU ggml_backend_graph_compute_async time-cost
graph_compute cost 504.811 ms  // graph_compute total-time-cost, a little gap between 504.811 and (102.017+367.552), ignore temporarily

The total is 504.811 ms, much lower than 1184.12 ms per token. Then I print other stages of llama-cli:

graph_init cost 0.005 ms
graph_build cost 0.648 ms
ggml_backend_sched_alloc_graph cost 695.747 ms
graph_compute cost 504.811 ms

ggml_backend_sched_alloc_graph cost almost 700 ms per token! Then inside ggml_backend_sched_alloc_graph, the following are some functions that take up a lot of time:

ggml_backend_sched_split_graph cost 144.407 ms
clear_tensors cost 269.318 ms  // in ggml_backend_buffer_reset
ggml_gallocr_init_tensor[nodes without view_src] total cost 206.843 ms
ggml_gallocr_init_tensor[nodes with view_src] total cost 69.264 ms

Seems that ggml_backend_sched_split_graph and ggml_gallocr_init_tensor involving RPC memory allocation, but is the time-cost normal? It's too long ... And why clear tensors cost 269 ms? It looks simple.

void host_buffer::clear_tensors() {
    _tensors.clear();
    LOG_DEBUG("clear host_buffer(%p) tensors\n", (void *) _data);
}

BTW my device: Android 8gen3 test model: qwen2.5-1.5b-instruct-q4_0.gguf

Jul 17 '25 03:07 finneyyan

Nice catch! actually I haven't look into the framework's implementation yet, but that's definitely a chance here.

ggml_backend_sched_alloc_graph cost almost 700 ms per token!

Thought it shouldn't have such latency here. will have a look, converted to a ticket into my project

And why clear tensors cost 269 ms? It looks simple.

One thing comes to my mind is the fast rpc overhead, thought we can have some optimization here, but have a question: did you get those timeout from the prefill stage or from the decoding stage?

Jul 17 '25 06:07 chraac

One thing comes to my mind is the fast rpc overhead, thought we can have some optimization here, but have a question: did you get those timeout from the prefill stage or from the decoding stage?

decoding stage. And just for clear_tensors, if _tensors is empty, it takes almost no time; otherwise it will take almost so much time. The evidence is if I run llama-cli with --no-warmup, prefill's clear_tensors take 0.001ms, but 298 ms when run it withthout --no-warmp.

Jul 17 '25 06:07 finneyyan

About optimizing llama-cli performance, I have some ideas:

Figure out the extra time about ggml_backend_sched_split_graph 'clear_tensors' and ggml_gallocr_init_tensor, they are too exaggerated. But I know little about Hexagon and fastRPC...
Reuse compute graph. Now each token decoding needs to be executed build_graph split_graph and sched_alloc_graph once, so the extra time would exist everytime. Reuse compute graph maybe helpful. I find ggerganov has done something about this but not merged yet (https://github.com/ggml-org/llama.cpp/pull/14482). I'll try it.
Reduce the number of sub-graphs. For qwen2.5-1.5b-q4_0 model, there are 169 NPU-subgraph and 170 CPU-subgraph. Each NPU-subgroup need to run update (ggml_cgraph->npu_graph?), I caculate the 169 times total need 53.6 ms. Reduce the number of subgraph maybe helpful, but more op need to be supported on NPU, it's not easy.
GEMV(mulmat with n=1) optimization. From hexagon profiler, accumulate all update and 'compute' is 310.1ms, it means even if we eliminate all extra time of fastRPC, the decode is still lower than 1000/310=3.2 token/s(CPU time is not considered). But due to bandwidth limitations, I am not sure if there is room for optimization in Q4_0 GEMV on NPU.

Is there anything in the above that is incorrectly stated because I don’t understand it well enough?

Jul 17 '25 09:07 finneyyan

Figure out the extra time about ggml_backend_sched_split_graph 'clear_tensors' and ggml_gallocr_init_tensor, they are too exaggerated. But I know little about Hexagon and fastRPC...

For this one, thought its about there're we're sending too many ipc calls, currently, in the tensor destroy, will call rpc function to destroy its npu instance one by one, thought can definitely have some optimization here, but let me double check first....

Jul 17 '25 12:07 chraac

Reuse compute graph. Now each token decoding needs to be executed build_graph split_graph and sched_alloc_graph

Nice one! Thought that worth a try, but, graph in ggml is quite dynamic, inside the backend, we cannot track the graph modification, so i had to call update each time when execute, but definitely can have a key or sth, like we've done in qnn backend, to insicate the graph is modified

Jul 17 '25 12:07 chraac

GEMV(mulmat with n=1) optimization.

Yeah, that's definitely a longer-term thing. If you check out the project backlog and my recent comment, you'll see I'm working on different strategies to boost mul_mat performance - mostly focusing on GEMM cases right now. For those, we can totally use VTCM to take some pressure off the bandwidth.

Some of these strategies will help GEMV cases too, but there are probably other opportunities we should explore for those specifically.

Jul 17 '25 13:07 chraac

A new test: When I enable FA, llama-cli decoding perf from 0.84 tok/s to 1.1 tok/s. split-graph stage:

ggml_backend_sched_split_graph cost 144.407 ms    -> 130.47 ms
clear_tensors cost 269.318 ms    -> 198.5 ms
ggml_gallocr_init_tensor[nodes without view_src] total cost 206.843 ms    -> 151.5 ms
ggml_gallocr_init_tensor[nodes with view_src] total cost 69.264 ms           -> 43.8 ms

graph-compute stage:

[CPU]total: 102.017 ms    -> 97.3 ms
[hexagon-npu]total: 367.552 ms    -> 289.4 ms
graph_compute cost 504.811 ms   -> 387.8 ms  // the gap between 504.811 and (102.017+367.552) is disappear when enable FA.

Besides, it weird that When I add up the update and compute data in the profiler, the sum is 389.3ms (device time), even larger than 289.4ms (host time, the total ggml_backend_graph_compute_async).

Jul 18 '25 03:07 finneyyan

Hi @finneyyan , wanna ask something before another testing, for the benchmark, are you running with the release build, right?

Jul 18 '25 04:07 chraac

Hi @finneyyan , wanna ask something before another testing, for the benchmark, are you running with the release build, right?

Yes, the debug build-type is much more slower

Jul 18 '25 06:07 finneyyan

Reuse compute graph maybe helpful. I find ggerganov has done something about this but not merged yet (https://github.com/ggml-org/llama.cpp/pull/14482). I'll try it.

It has been merged today. I simply merge master into dev-refactoring, the perf up from 1.1 token/s to 2.3 token/s now.

Jul 18 '25 09:07 finneyyan

Reuse compute graph maybe helpful. I find ggerganov has done something about this but not merged yet (ggml-org#14482). I'll try it.

It has been merged today. I simply merge master into dev-refactoring, the perf up from 1.1 token/s to 2.3 token/s now.

Thank you! Now working on optimizing clear_tensors at my branch, thought we can have new version soon, then I'll drop message here!

Jul 18 '25 09:07 chraac

Hi @finneyyan

Created another PR to fix the clear_tensors issue you said before, can have a look: https://github.com/chraac/llama.cpp/pull/52

Jul 19 '25 06:07 chraac

Created another PR to fix the clear_tensors issue you said before, can have a look: https://github.com/chraac/llama.cpp/pull/52

It works, the clear_tensor: 269.318 ms --(enable FA)--> 198.5 ms --(new PR)--> 3.4ms

Notice you add a function npu_device_tensors_free, and the key change is _tensors.clear() -> delete tensor. Why is it such a big improvement? _tensors.clear() need to call fastRPC but delete tensor needn't, right?

Jul 21 '25 06:07 finneyyan

For now, enable FA + enable graph reuse + in your new PR, the decode perf is:

Run split-graph only the first time of decode, it takes about 314 ms (from 700 to 314, great improvement).
Then only compute-graph in subsequent times, it takes about 380 ms.

So the total time is 314 + 380*N ms, N is the number of output tokens. When N is large enough, the 314 can be ignored and the max decode perf is 1000/380=2.63 token/s Need to do something about compute-graph if we wanna further improving perf.

Jul 21 '25 07:07 finneyyan

Why is it such a big improvement? _tensors.clear() need to call fastRPC but delete tensor needn't, right?

What I did here is just to reduce the rpc calls, from my profiling, each rpc calls can introduce several ms latency, so here I've merged them into just 1 rpc calls

Jul 21 '25 11:07 chraac

And for the graph_compute cost, thought we can have a breakdown with the profiler output that capture from logcat

Jul 21 '25 15:07 chraac

@chraac I'd like to ask, besides using llama.cpp, what other inference frameworks support Qualcomm NPU deployment? Does Qualcomm officially use QNN-SDK and Hexagon-SDK themself to deploy NPU? Do you know their performance?

Jul 23 '25 02:07 finneyyan

what other inference frameworks support Qualcomm NPU deployment?

Can check out MLLM - they have their own model format and NPU support.

Does Qualcomm officially use QNN-SDK and Hexagon-SDK themself to deploy NPU? Do you know their performance?

I'm not a Qualcomm employee, so I can't say for certain what they use internally. However, in large companies like this, there are typically multiple teams working on LLM inference acceleration using different approaches - some might use QNN while others use hexagon-npu directly.

They likely have access to HMX instructions that aren't publicly available, which could provide some performance benefits. That said, I think the real bottleneck is still memory bandwidth, so even with HMX, the gains might be modest.

Jul 23 '25 03:07 chraac

Hey @finneyyan , it’s been a while! We’ve made several improvements to the hexagon-npu backend and are seeing significant performance gains. When you have a moment, could you test your case again? You can also try the changes in PR #59.

Oct 26 '25 15:10 chraac