nullname
nullname
> How do you handle the QNN graph build-execute-free during inference? As we are also integrating the QNN in our framework, the graph building is time consuming and the memory...
> > > How do you handle the QNN graph build-execute-free during inference? As we are also integrating the QNN in our framework, the graph building is time consuming and...
> I tried to find out, if GPU/NPU/... use can help with power-consumption. Both for prompt-processing/fill (compute-bound) and token-generation/decode (mainly memory-bandwidth bound) - See [my medium.com article](https://medium.com/@andreask_75652/power-consumption-of-our-ai-use-f2b1f9bce97b), where I analyzed...
Hi @The-going , regarding the sunxi's pwm driver and gmac driver, created a PR at you fork, please have a look: https://github.com/The-going/armbian-build/pull/25
@igorpecovnik , info updated
Hi @hiwudery Sorry, Hexagon v68 is a bit old now—I’m focusing on v73 and newer architectures. If the public API differences between v68 and the newer toolchains are small, you...
你好,性能问题可以follow下这个thread哈,qnn的convert确实性能比较差 https://github.com/chraac/llama.cpp/issues/34#issuecomment-2708050770
> 观察到 bind_tensors 函数是将数据从 ggml_tensor.data 拷贝到 qnn_rpc_buffer,但由于 should_use_mem_handle 始终为false,实际并未完成这一步拷贝。那么qnn-npu每次使用数据都要在sdk内部进行自动拷贝吗?还是说它和CPU共用内存? Nice catch! 这里禁用这个rpc buffer的原因是,只在每个tensor里面使用rpc buffer,会无可避免的多一次memcpy,而如果直接把ggml tensor的data直接给qnn,有可能他会有更优的解决方案。 之前还设想过,如果把rpc_buffer给backend buffer管理,但是这个方案会导致一个buffer里面有多个tensor,这种方式好像在qnn里面没办法实现,不过这种方式在 `hexagon-npu` 里面实现了,所以理论上那里更高效
可以看下qnn内部打印的event的log,这里基本上排除了其他的因素,单纯就是他qnn graph的性能,8gen2下: ```log [profiler][MUL_MATf32_2048x512q4_K_2048x2f32f16_1024f16]print_profile_events start ---------------- [profiler][MUL_MATf32_2048x512q4_K_2048x2f32f16_1024f16]event[0]: Number of HVX threads used, count: 4 [profiler][MUL_MATf32_2048x512q4_K_2048x2f32f16_1024f16]event[1]: RPC (execute) time, duration: 29.409 ms [profiler][MUL_MATf32_2048x512q4_K_2048x2f32f16_1024f16]event[2]: QNN accelerator (execute) time, duration: 25.280 ms [profiler][MUL_MATf32_2048x512q4_K_2048x2f32f16_1024f16]event[3]:...
> 我又看了下qnn的.alloc_buffer,发现里面实际并没有分配npu内存,这可能是我上述尝试失败的原因。我查询到npu使用的内存是VCTM而非和CPU共用DDR,所以npu的内存管理都是在SDK内部进行的吗? 可以看下他programming reference的memory部分 https://docs.qualcomm.com/bundle/publicresource/topics/80-N2040-61/memory.html