finneyyan

Results 23 comments of finneyyan

> Created another PR to fix the clear_tensors issue you said before, can have a look: https://github.com/chraac/llama.cpp/pull/52 It works, the clear_tensor: 269.318 ms --(enable FA)--> 198.5 ms --(new PR)--> 3.4ms...

For now, enable FA + enable graph reuse + in your new PR, the `decode` perf is: 1. Run `split-graph` only the first time of decode, it takes about 314...

@chraac I'd like to ask, besides using llama.cpp, what other inference frameworks support Qualcomm NPU deployment? Does Qualcomm officially use `QNN-SDK` and `Hexagon-SDK` themself to deploy NPU? Do you know...