nullname comments

Results 104 comments of


                                            nullname

[qnn][bug] FP16 matmul 分配到 qnn_npu 上运行时推理崩溃

可以编译的时候加`--perf-log` 参考这个： https://github.com/chraac/llama.cpp/wiki/How-to-Build

[qnn][bug] FP16 matmul 分配到 qnn_npu 上运行时推理崩溃

> 您这里的dequant耗时应该没包含到30.237ms里吧？不包括，这些event是QNN内部的 > 我很奇怪明明有convert这个功能，为什么我运行Q4_0 mulmat(q4_0 + fp32 -> fp32)会报错出错的地方在哪？对了默认情况下quantized tensor的支持是屏蔽掉的，编译的时候有开关控制，可以看看我之前贴的 how to build

llama-cli on Hexagon-NPU introducing a lot of extra time

Nice catch! actually I haven't look into the framework's implementation yet, but that's definitely a chance here. > ggml_backend_sched_alloc_graph cost almost 700 ms per token! Thought it shouldn't have such...

llama-cli on Hexagon-NPU introducing a lot of extra time

> Figure out the extra time about ggml_backend_sched_split_graph 'clear_tensors' and ggml_gallocr_init_tensor, they are too exaggerated. But I know little about Hexagon and fastRPC... For this one, thought its about there're...

llama-cli on Hexagon-NPU introducing a lot of extra time

> Reuse compute graph. Now each token decoding needs to be executed build_graph split_graph and sched_alloc_graph Nice one! Thought that worth a try, but, graph in ggml is quite dynamic,...

llama-cli on Hexagon-NPU introducing a lot of extra time

> GEMV(mulmat with n=1) optimization. Yeah, that's definitely a longer-term thing. If you check out the project backlog and my recent comment, you'll see I'm working on different strategies to...

llama-cli on Hexagon-NPU introducing a lot of extra time

Hi @finneyyan , wanna ask something before another testing, for the benchmark, are you running with the release build, right?

llama-cli on Hexagon-NPU introducing a lot of extra time

> > Reuse compute graph maybe helpful. I find ggerganov has done something about this but not merged yet ([ggml-org#14482](https://github.com/ggml-org/llama.cpp/pull/14482)). I'll try it. > > It has been merged today....

llama-cli on Hexagon-NPU introducing a lot of extra time

Hi @finneyyan Created another PR to fix the `clear_tensors` issue you said before, can have a look: https://github.com/chraac/llama.cpp/pull/52

llama-cli on Hexagon-NPU introducing a lot of extra time

> Why is it such a big improvement? _tensors.clear() need to call fastRPC but delete tensor needn't, right? What I did here is just to reduce the rpc calls, from...