nullname comments

Results 104 comments of


                                            nullname

[bug] the NPU backend achives around 1/3 performance of CPU

Regarding your change here: ```diff + precision_config.precision = QNN_PRECISION_FLOAT16; ``` did you test it on F32 model? ahh, curious about whether we can force the percision here to F16 always

[bug] the NPU backend achives around 1/3 performance of CPU

> Regarding your change here: > > + precision_config.precision = QNN_PRECISION_FLOAT16; > did you test it on F32 model? ahh, curious about whether we can force the percision here to...

[bug] the NPU backend achives around 1/3 performance of CPU

Feel free to try [my script](https://github.com/chraac/llama-cpp-qnn-builder/blob/main/docker/docker_compose_run_test.sh) for a quick prototype verification. It can run the QNN backend through Qualcomm's NPU emulator. ```bash ./llama-cpp-qnn-builder/docker/docker_compose_compile.sh -r -d --print-build-time --build-linux-x64 --run-tests ```

[bug] the NPU backend achives around 1/3 performance of CPU

On thing forgot to said yesterday, looks like the `convert` op was horribly slow in npu: >Unfortunately we discovered that the conversion operations as implemented on the NPU were extremely...

[bug] the NPU backend achives around 1/3 performance of CPU

> I've set up a very simple simple profiler. It shows graph finalization have an expensive overhead. It takes ~6.42x time compared to execution. Yeah, nice! In the current codebase,...

[bug] the NPU backend achives around 1/3 performance of CPU

> Surprisingly, the convert operator takes 0 cycles. This might be related to the F16 setting I'm using. I'll try F32 later to check. Also noticed the transpose op takes...

[bug] the NPU backend achives around 1/3 performance of CPU

From the hexagon block diagram here (found in this artical: [Qualcomm’s Hexagon DSP, and now, NPU -- Chips and Cheese](https://chipsandcheese.com/p/qualcomms-hexagon-dsp-and-now-npu)): ![Image](https://github.com/user-attachments/assets/04f6e902-5c41-4f97-9a64-419f5cac95ee) appears that there's a TCM inside with only 8MB...