llama-cli on Hexagon-NPU introducing a lot of extra time
I find something weird about Hexagon-NPU backend performance.
Firstly I run test-backend-ops on Android 8gen3, the result is normal:
MUL_MAT(type_a=f16,type_b=f32,m=16416,n=1,k=128,bs=[8,1],nr=[4,1],per=[0,2,1,3],v=0): 744 runs - 22486.85 us/run - 134.48 MFLOP/run - 5.98 GFLOPS
MUL_MAT(type_a=f16,type_b=f32,m=128,n=1,k=16416,bs=[8,1],nr=[4,1],per=[0,1,2,3],v=1): 744 runs - 13501.82 us/run - 134.48 MFLOP/run - 9.96 GFLOPS
MUL_MAT(type_a=f32,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0): 852 runs - 9510.09 us/run - 117.44 MFLOP/run - 12.35 GFLOPS
MUL_MAT(type_a=f16,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0): 852 runs - 8131.15 us/run - 117.44 MFLOP/run - 14.44 GFLOPS
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0): 852 runs - 4886.67 us/run - 117.44 MFLOP/run - 24.03 GFLOPS
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0): 852 runs - 19410.53 us/run - 117.44 MFLOP/run - 6.05 GFLOPS
MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0): 852 runs - 408029.35 us/run - 117.44 MFLOP/run - 287.82 MFLOPS
But when I run llama-cli, the tg performance is very bad:
llama_perf_sampler_print: sampling time = 0.87 ms / 45 runs ( 0.02 ms per token, 51605.50 tokens per second)
llama_perf_context_print: load time = 5733.53 ms
llama_perf_context_print: prompt eval time = 3360.59 ms / 42 tokens ( 80.01 ms per token, 12.50 tokens per second)
llama_perf_context_print: eval time = 2368.24 ms / 2 runs ( 1184.12 ms per token, 0.84 tokens per second)
llama_perf_context_print: total time = 8105.73 ms / 44 tokens
Even if ops on NPU is not very fast, it shouldnot be so bad.
So I print time-cost about subgraph(NPU+CPU) compute of llama-cli by ggml_time_us, something very weird:
[CPU]total: 102.017 ms // CPU ggml_backend_graph_compute_async time-cost
[hexagon-npu]total: 367.552 ms // NPU ggml_backend_graph_compute_async time-cost
graph_compute cost 504.811 ms // graph_compute total-time-cost, a little gap between 504.811 and (102.017+367.552), ignore temporarily
The total is 504.811 ms, much lower than 1184.12 ms per token.
Then I print other stages of llama-cli:
graph_init cost 0.005 ms
graph_build cost 0.648 ms
ggml_backend_sched_alloc_graph cost 695.747 ms
graph_compute cost 504.811 ms
ggml_backend_sched_alloc_graph cost almost 700 ms per token!
Then inside ggml_backend_sched_alloc_graph, the following are some functions that take up a lot of time:
ggml_backend_sched_split_graph cost 144.407 ms
clear_tensors cost 269.318 ms // in ggml_backend_buffer_reset
ggml_gallocr_init_tensor[nodes without view_src] total cost 206.843 ms
ggml_gallocr_init_tensor[nodes with view_src] total cost 69.264 ms
Seems that ggml_backend_sched_split_graph and ggml_gallocr_init_tensor involving RPC memory allocation, but is the time-cost normal? It's too long ... And why clear tensors cost 269 ms? It looks simple.
void host_buffer::clear_tensors() {
_tensors.clear();
LOG_DEBUG("clear host_buffer(%p) tensors\n", (void *) _data);
}
BTW my device: Android 8gen3 test model: qwen2.5-1.5b-instruct-q4_0.gguf
Nice catch! actually I haven't look into the framework's implementation yet, but that's definitely a chance here.
ggml_backend_sched_alloc_graph cost almost 700 ms per token!
Thought it shouldn't have such latency here. will have a look, converted to a ticket into my project
And why clear tensors cost 269 ms? It looks simple.
One thing comes to my mind is the fast rpc overhead, thought we can have some optimization here, but have a question: did you get those timeout from the prefill stage or from the decoding stage?
One thing comes to my mind is the fast rpc overhead, thought we can have some optimization here, but have a question: did you get those timeout from the prefill stage or from the decoding stage?
decoding stage.
And just for clear_tensors, if _tensors is empty, it takes almost no time; otherwise it will take almost so much time. The evidence is if I run llama-cli with --no-warmup, prefill's clear_tensors take 0.001ms, but 298 ms when run it withthout --no-warmp.
About optimizing llama-cli performance, I have some ideas:
- Figure out the extra time about
ggml_backend_sched_split_graph'clear_tensors' andggml_gallocr_init_tensor, they are too exaggerated. But I know little about Hexagon and fastRPC... - Reuse compute graph. Now each token decoding needs to be executed
build_graphsplit_graphandsched_alloc_graphonce, so the extra time would exist everytime. Reuse compute graph maybe helpful. I find ggerganov has done something about this but not merged yet (https://github.com/ggml-org/llama.cpp/pull/14482). I'll try it. - Reduce the number of sub-graphs. For qwen2.5-1.5b-q4_0 model, there are 169 NPU-subgraph and 170 CPU-subgraph. Each NPU-subgroup need to run
update(ggml_cgraph->npu_graph?), I caculate the 169 times total need 53.6 ms. Reduce the number of subgraph maybe helpful, but more op need to be supported on NPU, it's not easy. - GEMV(mulmat with n=1) optimization. From hexagon profiler, accumulate all
updateand 'compute' is 310.1ms, it means even if we eliminate all extra time of fastRPC, the decode is still lower than1000/310=3.2 token/s(CPU time is not considered). But due to bandwidth limitations, I am not sure if there is room for optimization in Q4_0 GEMV on NPU.
Is there anything in the above that is incorrectly stated because I don’t understand it well enough?
Figure out the extra time about ggml_backend_sched_split_graph 'clear_tensors' and ggml_gallocr_init_tensor, they are too exaggerated. But I know little about Hexagon and fastRPC...
For this one, thought its about there're we're sending too many ipc calls, currently, in the tensor destroy, will call rpc function to destroy its npu instance one by one, thought can definitely have some optimization here, but let me double check first....
Reuse compute graph. Now each token decoding needs to be executed build_graph split_graph and sched_alloc_graph
Nice one! Thought that worth a try, but, graph in ggml is quite dynamic, inside the backend, we cannot track the graph modification, so i had to call update each time when execute, but definitely can have a key or sth, like we've done in qnn backend, to insicate the graph is modified
GEMV(mulmat with n=1) optimization.
Yeah, that's definitely a longer-term thing. If you check out the project backlog and my recent comment, you'll see I'm working on different strategies to boost mul_mat performance - mostly focusing on GEMM cases right now. For those, we can totally use VTCM to take some pressure off the bandwidth.
Some of these strategies will help GEMV cases too, but there are probably other opportunities we should explore for those specifically.
A new test: When I enable FA, llama-cli decoding perf from 0.84 tok/s to 1.1 tok/s. split-graph stage:
ggml_backend_sched_split_graph cost 144.407 ms -> 130.47 ms
clear_tensors cost 269.318 ms -> 198.5 ms
ggml_gallocr_init_tensor[nodes without view_src] total cost 206.843 ms -> 151.5 ms
ggml_gallocr_init_tensor[nodes with view_src] total cost 69.264 ms -> 43.8 ms
graph-compute stage:
[CPU]total: 102.017 ms -> 97.3 ms
[hexagon-npu]total: 367.552 ms -> 289.4 ms
graph_compute cost 504.811 ms -> 387.8 ms // the gap between 504.811 and (102.017+367.552) is disappear when enable FA.
Besides, it weird that When I add up the update and compute data in the profiler, the sum is 389.3ms (device time), even larger than 289.4ms (host time, the total ggml_backend_graph_compute_async).
Hi @finneyyan , wanna ask something before another testing, for the benchmark, are you running with the release build, right?
Hi @finneyyan , wanna ask something before another testing, for the benchmark, are you running with the release build, right?
Yes, the debug build-type is much more slower
Reuse compute graph maybe helpful. I find ggerganov has done something about this but not merged yet (https://github.com/ggml-org/llama.cpp/pull/14482). I'll try it.
It has been merged today. I simply merge master into dev-refactoring, the perf up from 1.1 token/s to 2.3 token/s now.
Reuse compute graph maybe helpful. I find ggerganov has done something about this but not merged yet (ggml-org#14482). I'll try it.
It has been merged today. I simply merge
masterintodev-refactoring, the perf up from 1.1 token/s to 2.3 token/s now.
Thank you! Now working on optimizing clear_tensors at my branch, thought we can have new version soon, then I'll drop message here!
Hi @finneyyan
Created another PR to fix the clear_tensors issue you said before, can have a look: https://github.com/chraac/llama.cpp/pull/52
Created another PR to fix the clear_tensors issue you said before, can have a look: https://github.com/chraac/llama.cpp/pull/52
It works, the clear_tensor: 269.318 ms --(enable FA)--> 198.5 ms --(new PR)--> 3.4ms
Notice you add a function npu_device_tensors_free, and the key change is _tensors.clear() -> delete tensor. Why is it such a big improvement? _tensors.clear() need to call fastRPC but delete tensor needn't, right?
For now, enable FA + enable graph reuse + in your new PR, the decode perf is:
- Run
split-graphonly the first time of decode, it takes about 314 ms (from 700 to 314, great improvement). - Then only
compute-graphin subsequent times, it takes about 380 ms.
So the total time is 314 + 380*N ms, N is the number of output tokens. When N is large enough, the 314 can be ignored and the max decode perf is 1000/380=2.63 token/s
Need to do something about compute-graph if we wanna further improving perf.
Why is it such a big improvement? _tensors.clear() need to call fastRPC but delete tensor needn't, right?
What I did here is just to reduce the rpc calls, from my profiling, each rpc calls can introduce several ms latency, so here I've merged them into just 1 rpc calls
And for the graph_compute cost, thought we can have a breakdown with the profiler output that capture from logcat
@chraac I'd like to ask, besides using llama.cpp, what other inference frameworks support Qualcomm NPU deployment? Does Qualcomm officially use QNN-SDK and Hexagon-SDK themself to deploy NPU? Do you know their performance?
what other inference frameworks support Qualcomm NPU deployment?
Can check out MLLM - they have their own model format and NPU support.
Does Qualcomm officially use QNN-SDK and Hexagon-SDK themself to deploy NPU? Do you know their performance?
I'm not a Qualcomm employee, so I can't say for certain what they use internally. However, in large companies like this, there are typically multiple teams working on LLM inference acceleration using different approaches - some might use QNN while others use hexagon-npu directly.
They likely have access to HMX instructions that aren't publicly available, which could provide some performance benefits. That said, I think the real bottleneck is still memory bandwidth, so even with HMX, the gains might be modest.
Hey @finneyyan , it’s been a while! We’ve made several improvements to the hexagon-npu backend and are seeing significant performance gains. When you have a moment, could you test your case again? You can also try the changes in PR #59.