nullname

Results 104 comments of nullname

可以编译的时候加`--perf-log` 参考这个: https://github.com/chraac/llama.cpp/wiki/How-to-Build

> 您这里的dequant耗时应该没包含到30.237ms里吧? 不包括,这些event是QNN内部的 > 我很奇怪明明有convert这个功能,为什么我运行Q4_0 mulmat(q4_0 + fp32 -> fp32)会报错 出错的地方在哪?对了默认情况下quantized tensor的支持是屏蔽掉的,编译的时候有开关控制,可以看看我之前贴的 how to build

Nice catch! actually I haven't look into the framework's implementation yet, but that's definitely a chance here. > ggml_backend_sched_alloc_graph cost almost 700 ms per token! Thought it shouldn't have such...

> Figure out the extra time about ggml_backend_sched_split_graph 'clear_tensors' and ggml_gallocr_init_tensor, they are too exaggerated. But I know little about Hexagon and fastRPC... For this one, thought its about there're...

> Reuse compute graph. Now each token decoding needs to be executed build_graph split_graph and sched_alloc_graph Nice one! Thought that worth a try, but, graph in ggml is quite dynamic,...

> GEMV(mulmat with n=1) optimization. Yeah, that's definitely a longer-term thing. If you check out the project backlog and my recent comment, you'll see I'm working on different strategies to...

Hi @finneyyan , wanna ask something before another testing, for the benchmark, are you running with the release build, right?

> > Reuse compute graph maybe helpful. I find ggerganov has done something about this but not merged yet ([ggml-org#14482](https://github.com/ggml-org/llama.cpp/pull/14482)). I'll try it. > > It has been merged today....

Hi @finneyyan Created another PR to fix the `clear_tensors` issue you said before, can have a look: https://github.com/chraac/llama.cpp/pull/52

> Why is it such a big improvement? _tensors.clear() need to call fastRPC but delete tensor needn't, right? What I did here is just to reduce the rpc calls, from...