nullname
nullname
> The prefilling stage is performered by QNN & CPU, and the inference stage is performed by CPU. interesting, during the prefill stage, we typically deal with larger tensors, which...
Hi @Gianthard-cyh, recently we've add the new `hexagon-npu` backend that is totally independent from qnn and parts of source code run inside QuRT, can manipulate the HVX registers, similar to...
> [@chraac](https://github.com/chraac) [@Gianthard-cyh](https://github.com/Gianthard-cyh) I tested the QNN backend on Snapdragon 8 Gen 4 and found that bind_tensor accounts for 84% of the time (46500ms per decode), while qnn_graph->execute uses 14%...
Hi @Dantetang @Gianthard-cyh @cm4ker, sorry for bothering! We've applied some optimizations to the hexagon-npu backend that now utilize HVX instructions. While it's still slower than the CPU backend, the performance...
> I successfully built llama with QNN support, but when I try to compile Hexagon support, it seems the SDK I downloaded (HEXAGON_SDK_ROOT env) is wrong. Can you point me...
Create a discussion here for the windows build instructions of `hexagon-npu` backend: [#44](https://github.com/chraac/llama.cpp/discussions/44)
> Hi, I'm curious about the current implementation for matmul. I've noticed that implementations like ExecuTorch and PowerServe convert matmul into QNN convolution to achieve the desired performance. However, if...
oh, do you mind build it with offical `ndk-toolchain` on your host machine? thats the official way google suggested, and its clang is relatively new and support the newer c/c++...
actually, you have to push the qnn related dynamic library to you cellphone before run, a complete list of those libs can be found here: https://github.com/chraac/llama-cpp-qnn-builder/blob/dd7ba303a8e3213c8cafe330c0938b25c6bd788f/docker/build_in_container.sh#L83
maybe you can try to overwrite the `LD_LIBRARY_PATH` env var