llama.cpp [qnn][bug] FP16 matmul 分配到 qnn

Name and Version

测试设备：骁龙8gen3 测试模型：qwen2.5-1.5b-instruct-fp16.gguf

Operating systems

Android

GGML backends

QNN

Hardware

骁龙8gen3

Models

No response

Problem description & steps to reproduce

同标题（分配到qnn_gpu上时可正常推理）

First Bad Commit

No response

Relevant log output

[qnn-npu][MUL_MATf32_1536x8960f32_1536x42f32#MUL(SILU,MUL_MAT)#MUL_MAT(NONE,MUL)#ADD(MUL_MAT,ADD)f32_1536x42f32][execute]error: QNN_GRAPH_ERROR_INVALID_HANDLE
graph_compute: ggml_backend_sched_graph_compute_async failed with error -1
process_ubatch: failed to compute graph, compute status: -1
decode: removing KV cache entries for seq_id = 0, pos = [0, +inf)
llama_decode: failed to decode, ret = -3
main : failed to eval
idx 1, name:qnn-gpu
idx 2, name:qnn-npu
FORTIFY: pthread_mutex_destroy called on a destroyed mutex (0xb4000070ae23a450)
Aborted (core dumped)

Jun 24 '25 02:06 finneyyan

原因可能在于规格不匹配。查询手册发现NPU支持的mulmat算子类型需要输入、权重、输出全部为fp16，而我只有权重是fp16，输入和输出都是fp32. 这两天我会跟进这个问题，有进展会在下面贴出

Jun 30 '25 06:06 finneyyan

原因可能在于规格不匹配。查询手册发现NPU支持的mulmat算子类型需要输入、权重、输出全部为fp16，而我只有权重是fp16，输入和输出都是fp32. 这两天我会跟进这个问题，有进展会在下面贴出

不确定是不是这个原因了，因为我发现当前qnn convert函数中涉及到类型的判断和转换，对于我的算子场景 fp16 + fp32 -> fp32，是有符合的规则会自动将src0的fp16转换成fp32. 如果convert函数没问题，似乎应该是能运行的？

我在mulmat算子前后各插入一个数据类型转换算子，使得mulmat规格调整为 fp16 + fp16 -> fp16，此时可以正常运行
观察到 bind_tensors 函数是将数据从 ggml_tensor.data 拷贝到 qnn_rpc_buffer，但由于 should_use_mem_handle 始终为false，实际并未完成这一步拷贝。那么qnn-npu每次使用数据都要在sdk内部进行自动拷贝吗？还是说它和CPU共用内存？
我将mulmat的 fp16 weight 放到CPU内存和QNN内存作对比（分别走CPU的.alloc_buffer和QNN的.alloc_buffer），发现二者最终的性能完全一致，这似乎不符合直觉（和上一个问题可能类似）

我在mulmat算子前后各插入一个数据类型转换算子，使得mulmat规格调整为 fp16 + fp16 -> fp16，此时可以正常运行

但性能非常差，大约只有CPU的十分之一。看到此前社区中有提到 convert 和 bind_tensors 耗时很长，但我的算子场景中并不是，绝大多数都是execute本身的耗时。下图是其中两次qnn_graph计算的各阶段统计：（全fp16 此时不走convert）

Jul 02 '25 11:07 finneyyan

但性能非常差，大约只有CPU的十分之一。看到此前社区中有提到 convert 和 bind_tensors 耗时很长，但我的算子场景中并不是，绝大多数都是execute本身的耗时

如果这个测试没有问题，我很费解为什么execute耗时这么长，高通官方不是标称它们的NPU性能非常强吗？这中间是差在哪里，还能怎么优化呢？

Jul 02 '25 11:07 finneyyan

你好，性能问题可以follow下这个thread哈，qnn的convert确实性能比较差 https://github.com/chraac/llama.cpp/issues/34#issuecomment-2708050770

Jul 02 '25 13:07 chraac

观察到 bind_tensors 函数是将数据从 ggml_tensor.data 拷贝到 qnn_rpc_buffer，但由于 should_use_mem_handle 始终为false，实际并未完成这一步拷贝。那么qnn-npu每次使用数据都要在sdk内部进行自动拷贝吗？还是说它和CPU共用内存？

Nice catch! 这里禁用这个rpc buffer的原因是，只在每个tensor里面使用rpc buffer，会无可避免的多一次memcpy，而如果直接把ggml tensor的data直接给qnn，有可能他会有更优的解决方案。

之前还设想过，如果把rpc_buffer给backend buffer管理，但是这个方案会导致一个buffer里面有多个tensor，这种方式好像在qnn里面没办法实现，不过这种方式在 hexagon-npu 里面实现了，所以理论上那里更高效

Jul 02 '25 17:07 chraac

你好，性能问题可以follow下这个thread哈，qnn的convert确实性能比较差

疑问就在这里，我实测发现自己的场景下 convert 和 bind_tensors 耗时并不是很长，绝大多数都是execute本身的耗时。如果我的测试没问题，这么差的耗时和官方标称性能主要是差在哪里呢（这是我最关心的问题：NPU为什么会这么慢，是本身就慢、llama.cpp框架引入的额外耗时、还是使用配置上的问题）？我倒是看到了一些可选配置，但因为不熟悉所以也无法判断是否可以通过调整配置来优化性能

if (device == QNN_BACKEND_NPU) {
        // TODO: fix graph config here for NPU
        std::vector<const QnnGraph_Config_t *> graph_configs;

        auto hvx_config = make_graph_config(&kDefaultHvxConfig);
        graph_configs.push_back(&hvx_config);

        auto dlbc_config = make_graph_config(&kDefaultDlbcConfig);
        graph_configs.push_back(&dlbc_config);

        auto opt_config = make_graph_config(&kDefaultOptConfig);
        graph_configs.push_back(&opt_config);

        auto vctm_sub_config = make_vtcm_config(vtcm_size_in_mb);
        auto vtcm_config     = make_graph_config(&vctm_sub_config);
        graph_configs.push_back(&vtcm_config);

        if (precision == qnn_graph::kHtpFp16) {
            auto precision_config = make_graph_config(&kHtpPrecisionConfigF16);
            graph_configs.push_back(&precision_config);
            QNN_LOG_DEBUG("[%s][%s]set precision to F16\n", get_backend_name(device), graph_name.c_str());
        }

        graph_configs.push_back(nullptr);
        error = qnn_interface->qnn_graph_create(qnn_context, graph_name.c_str(), graph_configs.data(), &graph_handle);
}

Jul 03 '25 03:07 finneyyan

这里禁用这个rpc buffer的原因是，只在每个tensor里面使用rpc buffer，会无可避免的多一次memcpy，而如果直接把ggml tensor的data直接给qnn，有可能他会有更优的解决方案。

您是说把数据地址送到qnn sdk中，让其自行处理吗？这种方式会比手动拷贝到npu内存更好吗？

我将mulmat的 fp16 weight 放到CPU内存和QNN内存作对比（分别走CPU的.alloc_buffer和QNN的.alloc_buffer），发现二者最终的性能完全一致，这似乎不符合直觉

我又看了下qnn的.alloc_buffer，发现里面实际并没有分配npu内存，这可能是我上述尝试失败的原因。我查询到npu使用的内存是VCTM而非和CPU共用DDR，所以npu的内存管理都是在SDK内部进行的吗？

Jul 03 '25 04:07 finneyyan

可以看下qnn内部打印的event的log，这里基本上排除了其他的因素，单纯就是他qnn graph的性能，8gen2下：

[profiler][MUL_MATf32_2048x512q4_K_2048x2f32f16_1024f16]print_profile_events start ----------------
[profiler][MUL_MATf32_2048x512q4_K_2048x2f32f16_1024f16]event[0]: Number of HVX threads used, count: 4
[profiler][MUL_MATf32_2048x512q4_K_2048x2f32f16_1024f16]event[1]: RPC (execute) time, duration: 29.409 ms
[profiler][MUL_MATf32_2048x512q4_K_2048x2f32f16_1024f16]event[2]: QNN accelerator (execute) time, duration: 25.280 ms
[profiler][MUL_MATf32_2048x512q4_K_2048x2f32f16_1024f16]event[3]: Num times yield occured, count: 0
[profiler][MUL_MATf32_2048x512q4_K_2048x2f32f16_1024f16]event[4]: Time for initial VTCM acquire, duration: 0.839 ms
[profiler][MUL_MATf32_2048x512q4_K_2048x2f32f16_1024f16]event[5]: Time for HVX + HMX power on and acquire, duration: 1.121 ms
[profiler][MUL_MATf32_2048x512q4_K_2048x2f32f16_1024f16]event[6]: Accelerator (execute) time (cycles), sub_count: 3, start -------------
[profiler][MUL_MATf32_2048x512q4_K_2048x2f32f16_1024f16]sub_event[0]: Input OpId_2 (cycles), cycles: 340526
[profiler][MUL_MATf32_2048x512q4_K_2048x2f32f16_1024f16]sub_event[1]: Vcur-2:OpId_45 (cycles), cycles: 35690645
[profiler][MUL_MATf32_2048x512q4_K_2048x2f32f16_1024f16]sub_event[2]: Output OpId_3 (cycles), cycles: 10642
[profiler][MUL_MATf32_2048x512q4_K_2048x2f32f16_1024f16]event[6]: Accelerator (execute) time (cycles), cycles: 36041813, end --------------
[profiler][MUL_MATf32_2048x512q4_K_2048x2f32f16_1024f16]event[7]: Accelerator (execute) time, duration: 25.070 ms
[profiler][MUL_MATf32_2048x512q4_K_2048x2f32f16_1024f16]event[8]: Accelerator (execute excluding wait) time, duration: 24.541 ms
[profiler][MUL_MATf32_2048x512q4_K_2048x2f32f16_1024f16]event[9]: Unknown event type, 
[profiler][MUL_MATf32_2048x512q4_K_2048x2f32f16_1024f16]event[10]: QNN (execute) time, duration: 30.237 ms
[profiler][MUL_MATf32_2048x512q4_K_2048x2f32f16_1024f16]print_profile_events end -----------------

这里q4k会cpu dequant，然后变成F32给到qnn，可以看到光跑个 2048x512, 2048x2 都耗时 30.237 ms，这里可以简单得出结论qnn这个框架的额外开销确实不小因为之前测试的 hexagon-npu，只用hvx的intrinsics，跑 16384×16384, 16384×1 也就 ~143ms

ref: the NPU backend achives around 1/3 performance of CPU Results Summary

Jul 03 '25 04:07 chraac

我又看了下qnn的.alloc_buffer，发现里面实际并没有分配npu内存，这可能是我上述尝试失败的原因。我查询到npu使用的内存是VCTM而非和CPU共用DDR，所以npu的内存管理都是在SDK内部进行的吗？

可以看下他programming reference的memory部分 https://docs.qualcomm.com/bundle/publicresource/topics/80-N2040-61/memory.html

Jul 03 '25 04:07 chraac

可以看下qnn内部打印的event的log

请问这个是怎么开启的，我没有看到这些log信息

Jul 03 '25 07:07 finneyyan

可以编译的时候加--perf-log

参考这个： https://github.com/chraac/llama.cpp/wiki/How-to-Build

Jul 03 '25 07:07 chraac

我的耗时数据比您的更夸张，(1536x1536)(15361)的f16 mulmat算子，没有dequant环节 QNN accelerator (execute) time, duration: 2.608 ms QNN (execute) time, duration: 5.268 ms 中间的gap全是RPC通信耗时和QNN框架耗时，这合理吗

[profiler][MUL_MATf16_1536x1536f16_1536f16]print_profile_events start ----------------
[profiler][MUL_MATf16_1536x1536f16_1536f16]event[0]: Number of HVX threads used, count: 4
[profiler][MUL_MATf16_1536x1536f16_1536f16]event[1]: RPC (execute) time, duration: 4.893 ms
[profiler][MUL_MATf16_1536x1536f16_1536f16]event[2]: QNN accelerator (execute) time, duration: 2.608 ms
[profiler][MUL_MATf16_1536x1536f16_1536f16]event[3]: Num times yield occured, count: 0
[profiler][MUL_MATf16_1536x1536f16_1536f16]event[4]: Time for initial VTCM acquire, duration: 0.954 ms
[profiler][MUL_MATf16_1536x1536f16_1536f16]event[5]: Time for HVX + HMX power on and acquire, duration: 1.062 ms
[profiler][MUL_MATf16_1536x1536f16_1536f16]event[6]: Accelerator (execute) time (cycles), sub_count: 3, start -------------
[profiler][MUL_MATf16_1536x1536f16_1536f16]sub_event[0]: Input OpId_2 (cycles), cycles: 190854
[profiler][MUL_MATf16_1536x1536f16_1536f16]sub_event[1]: node_4:OpId_64 (cycles), cycles: 2240366
[profiler][MUL_MATf16_1536x1536f16_1536f16]sub_event[2]: Output OpId_3 (cycles), cycles: 4395
[profiler][MUL_MATf16_1536x1536f16_1536f16]event[6]: Accelerator (execute) time (cycles), cycles: 2435615, end --------------
[profiler][MUL_MATf16_1536x1536f16_1536f16]event[7]: Accelerator (execute) time, duration: 0.874 ms
[profiler][MUL_MATf16_1536x1536f16_1536f16]event[8]: Accelerator (execute excluding wait) time, duration: 0.413 ms
[profiler][MUL_MATf16_1536x1536f16_1536f16]event[9]: Unknown event type, 
[profiler][MUL_MATf16_1536x1536f16_1536f16]event[10]: QNN (execute) time, duration: 5.268 ms
[profiler][MUL_MATf16_1536x1536f16_1536f16]print_profile_events end -----------------

这里q4k会cpu dequant，然后变成F32给到qnn，可以看到光跑个 2048x512, 2048x2 都耗时 30.237 ms，这里可以简单得出结论qnn这个框架的额外开销确实不小

您这里的dequant耗时应该没包含到30.237ms里吧？是在graph_execute前的convert函数中完成的吗？我很奇怪明明有convert这个功能，为什么我运行Q4_0 mulmat(q4_0 + fp32 -> fp32)会报错

Jul 04 '25 02:07 finneyyan

您这里的dequant耗时应该没包含到30.237ms里吧？

不包括，这些event是QNN内部的

我很奇怪明明有convert这个功能，为什么我运行Q4_0 mulmat(q4_0 + fp32 -> fp32)会报错

出错的地方在哪？对了默认情况下quantized tensor的支持是屏蔽掉的，编译的时候有开关控制，可以看看我之前贴的 how to build

Jul 04 '25 02:07 chraac

[qnn][bug] FP16 matmul 分配到 qnn_npu 上运行时推理崩溃

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output