nullname
nullname
Regarding your change here: ```diff + precision_config.precision = QNN_PRECISION_FLOAT16; ``` did you test it on F32 model? ahh, curious about whether we can force the percision here to F16 always
> Regarding your change here: > > + precision_config.precision = QNN_PRECISION_FLOAT16; > did you test it on F32 model? ahh, curious about whether we can force the percision here to...
Feel free to try [my script](https://github.com/chraac/llama-cpp-qnn-builder/blob/main/docker/docker_compose_run_test.sh) for a quick prototype verification. It can run the QNN backend through Qualcomm's NPU emulator. ```bash ./llama-cpp-qnn-builder/docker/docker_compose_compile.sh -r -d --print-build-time --build-linux-x64 --run-tests ```
On thing forgot to said yesterday, looks like the `convert` op was horribly slow in npu: >Unfortunately we discovered that the conversion operations as implemented on the NPU were extremely...
> I've set up a very simple simple profiler. It shows graph finalization have an expensive overhead. It takes ~6.42x time compared to execution. Yeah, nice! In the current codebase,...
> Surprisingly, the convert operator takes 0 cycles. This might be related to the F16 setting I'm using. I'll try F32 later to check. Also noticed the transpose op takes...
From the hexagon block diagram here (found in this artical: [Qualcomm’s Hexagon DSP, and now, NPU -- Chips and Cheese](https://chipsandcheese.com/p/qualcomms-hexagon-dsp-and-now-npu)):  appears that there's a TCM inside with only 8MB...
Update here: integrated QNN NPU event tracing func into my fork and conducted a test on my 8gen 2 device. rsult list below, we can have a further look event...
hi @Gianthard-cyh , how are things going on your side? added this issue to [project backlog](https://github.com/users/chraac/projects/2/views/3) and will take a more detailed look when I have some time available.
> I've discovered a repo that implemented all ops needed in LLMs. https://github.com/UbiquitousLearning/mllm/tree/main. About memory profiling, I didn't find the corresponding API. nice one! did you test the performance on...