llama.cpp ggml-qnn: add Qualcomm QNN(Qualcomm Neural Network,aka Qualcomm AI Engine Direct) backend

Purpose

Android maintained its position as the leading mobile operating system worldwide in the fourth quarter of 2023 with a market share of 70.1 percent . Qualcomm is No.1 mobile SoC semiconductor company in our planet currently.

QNN(Qualcomm Neural Network, aka Qualcomm AI Engine Direct) SDK is verified to work with the following versions of the ML frameworks:

TensorFlow: tf-1.15.0, or tf-2.10.1
TFLite: tflite-2.3.0
PyTorch: torch-1.13.1
ONNX: onnx-1.11.0

As a a very compact/highly well-designed/highly optimization/highly performance C/C++ machine learning framework/library, this PR aims to add Qualcomm's QNN backend for ggml.

Status

Data path works fine as expected with whisper.cpp and llama.cpp using QNN backend and verified on both low-end and high-end Android phones based on Qualcomm mobile SoC.

319780607

504893116

4x performance gains for GGML_OP_MUL_MAT using QNN CPU backend with 1 thread on a Qualcomm mobile SoC based high-end Android phone(Xiaomi 14). The performance of GGML_OP_MUL_MAT might/should/would be improved much more using QNN HTP(aka DSP) backend after we know the secrets(QNN RPC, multithreading in HTP backend......) of Qualcomm's Hexagon NPU(aka HTP or DSP) Cluster. The GGML community could do this if this PR is accepted.

1922265373

250505401

Thanks to the highly well-designed/well-implemented test-backend-ops.cpp, test-backend-ops.cpp works fine as expected on Xiaomi14(Qualcomm SM8650-AB Snapdragon 8 Gen 3 (4 nm))

Screenshot from 2024-04-25 23-20-22

Todo

Qualcomm's QNN backend has some todo tasks that will hopefully be completed by the GGML community after this PR is accepted:

1. only support FP32 / FP16 and the input and output tensors must be of the same data type
1. lack of implementation of other GGML-OPs using QNN API. this work is very similar to GGML_OP_ADD / GGML_OP_MUL / GGML_OP_MULMAT in ggml-qnn.cpp.
1. multithreading not working with QNN GPU&HTP (aka DSP) backend
1. QNN's RPC feature(which useful for QNN HTP(aka DSP) backend) not used
1. multi QNN backend(CPU/GPU/DSP) simultaneously not support

How to verify QNN backend or participate in develop/verify QNN backend

Thanks to the highly well-designed / highly well-implemented test-backend-ops.cpp, please use dedicated scripts in directory tests/ggml-qnn to verify/validate QNN backend on Qualcomm SoC based Android phone, this is the recommended method.

or

Please refer to the steps in README-qnn.md(this README-qnn.md heavily references Intel's README-sycl.md and thanks too much sincerely).

Apr 24 '24 08:04 jeffzhou2000

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 540 iterations 🚀

Expand details for performance related PR only

Concurrent users: 8, duration: 10m
HTTP request : avg=8677.33ms p(95)=20035.75ms fails=, finish reason: stop=492 truncated=48
Prompt processing (pp): avg=95.63tk/s p(95)=443.17tk/s
Token generation (tg): avg=47.46tk/s p(95)=47.64tk/s
ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=qualcomm_qnn_backend_for_ggml commit=a98a4e999000105b81b472c7b36ff80131d68ef1

prompt_tokens_seconds

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 540 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1718010789 --> 1718011423
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 593.29, 593.29, 593.29, 593.29, 593.29, 747.27, 747.27, 747.27, 747.27, 747.27, 756.49, 756.49, 756.49, 756.49, 756.49, 776.61, 776.61, 776.61, 776.61, 776.61, 836.27, 836.27, 836.27, 836.27, 836.27, 841.05, 841.05, 841.05, 841.05, 841.05, 838.87, 838.87, 838.87, 838.87, 838.87, 859.25, 859.25, 859.25, 859.25, 859.25, 865.08, 865.08, 865.08, 865.08, 865.08, 858.89, 858.89, 858.89, 858.89, 858.89, 883.85, 883.85, 883.85, 883.85, 883.85, 891.89, 891.89, 891.89, 891.89, 891.89, 873.3, 873.3, 873.3, 873.3, 873.3, 893.13, 893.13, 893.13, 893.13, 893.13, 909.48, 909.48, 909.48, 909.48, 909.48, 910.98, 910.98, 910.98, 910.98, 910.98, 911.31, 911.31, 911.31, 911.31, 911.31, 910.69, 910.69, 910.69, 910.69, 910.69, 914.6, 914.6, 914.6, 914.6, 914.6, 928.83, 928.83, 928.83, 928.83, 928.83, 927.37, 927.37, 927.37, 927.37, 927.37, 921.49, 921.49, 921.49, 921.49, 921.49, 925.25, 925.25, 925.25, 925.25, 925.25, 928.15, 928.15, 928.15, 928.15, 928.15, 942.74, 942.74, 942.74, 942.74, 942.74, 924.43, 924.43, 924.43, 924.43, 924.43, 923.95, 923.95, 923.95, 923.95, 923.95, 915.03, 915.03, 915.03, 915.03, 915.03, 911.66, 911.66, 911.66, 911.66, 911.66, 909.5, 909.5, 909.5, 909.5, 909.5, 914.04, 914.04, 914.04, 914.04, 914.04, 911.98, 911.98, 911.98, 911.98, 911.98, 910.75, 910.75, 910.75, 910.75, 910.75, 916.72, 916.72, 916.72, 916.72, 916.72, 926.62, 926.62, 926.62, 926.62, 926.62, 924.55, 924.55, 924.55, 924.55, 924.55, 927.08, 927.08, 927.08, 927.08, 927.08, 921.68, 921.68, 921.68, 921.68, 921.68, 920.82, 920.82, 920.82, 920.82, 920.82, 921.7, 921.7, 921.7, 921.7, 921.7, 922.98, 922.98, 922.98, 922.98, 922.98, 930.8, 930.8, 930.8, 930.8, 930.8, 921.59, 921.59, 921.59, 921.59, 921.59, 897.51, 897.51, 897.51, 897.51, 897.51, 894.98, 894.98, 894.98, 894.98, 894.98, 893.03, 893.03, 893.03, 893.03, 893.03, 895.37, 895.37, 895.37, 895.37, 895.37, 897.77, 897.77, 897.77, 897.77, 897.77, 896.81, 896.81, 896.81, 896.81, 896.81, 899.61, 899.61, 899.61, 899.61, 899.61, 898.83, 898.83, 898.83, 898.83, 898.83, 901.17, 901.17, 901.17, 901.17, 901.17, 890.73, 890.73, 890.73, 890.73, 890.73, 888.87, 888.87, 888.87, 888.87, 888.87, 889.05, 889.05, 889.05, 889.05, 889.05, 889.17, 889.17, 889.17, 889.17, 889.17, 888.29, 888.29, 888.29, 888.29, 888.29, 887.41, 887.41, 887.41, 887.41, 887.41, 888.05, 888.05, 888.05, 888.05, 888.05, 888.97, 888.97, 888.97, 888.97, 888.97, 889.62, 889.62, 889.62]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 540 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1718010789 --> 1718011423
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 41.93, 41.93, 41.93, 41.93, 41.93, 35.06, 35.06, 35.06, 35.06, 35.06, 27.87, 27.87, 27.87, 27.87, 27.87, 30.27, 30.27, 30.27, 30.27, 30.27, 31.24, 31.24, 31.24, 31.24, 31.24, 31.49, 31.49, 31.49, 31.49, 31.49, 32.61, 32.61, 32.61, 32.61, 32.61, 33.52, 33.52, 33.52, 33.52, 33.52, 33.92, 33.92, 33.92, 33.92, 33.92, 34.15, 34.15, 34.15, 34.15, 34.15, 34.26, 34.26, 34.26, 34.26, 34.26, 33.88, 33.88, 33.88, 33.88, 33.88, 33.24, 33.24, 33.24, 33.24, 33.24, 33.26, 33.26, 33.26, 33.26, 33.26, 31.54, 31.54, 31.54, 31.54, 31.54, 31.03, 31.03, 31.03, 31.03, 31.03, 29.95, 29.95, 29.95, 29.95, 29.95, 29.72, 29.72, 29.72, 29.72, 29.72, 29.96, 29.96, 29.96, 29.96, 29.96, 29.84, 29.84, 29.84, 29.84, 29.84, 29.65, 29.65, 29.65, 29.65, 29.65, 29.74, 29.74, 29.74, 29.74, 29.74, 29.88, 29.88, 29.88, 29.88, 29.88, 30.09, 30.09, 30.09, 30.09, 30.09, 30.2, 30.2, 30.2, 30.2, 30.2, 30.22, 30.22, 30.22, 30.22, 30.22, 30.51, 30.51, 30.51, 30.51, 30.51, 30.47, 30.47, 30.47, 30.47, 30.47, 30.42, 30.42, 30.42, 30.42, 30.42, 30.68, 30.68, 30.68, 30.68, 30.68, 30.77, 30.77, 30.77, 30.77, 30.77, 30.87, 30.87, 30.87, 30.87, 30.87, 31.02, 31.02, 31.02, 31.02, 31.02, 31.2, 31.2, 31.2, 31.2, 31.2, 31.05, 31.05, 31.05, 31.05, 31.05, 31.03, 31.03, 31.03, 31.03, 31.03, 30.8, 30.8, 30.8, 30.8, 30.8, 30.35, 30.35, 30.35, 30.35, 30.35, 30.3, 30.3, 30.3, 30.3, 30.3, 30.55, 30.55, 30.55, 30.55, 30.55, 30.68, 30.68, 30.68, 30.68, 30.68, 30.79, 30.79, 30.79, 30.79, 30.79, 30.73, 30.73, 30.73, 30.73, 30.73, 30.24, 30.24, 30.24, 30.24, 30.24, 29.97, 29.97, 29.97, 29.97, 29.97, 29.37, 29.37, 29.37, 29.37, 29.37, 29.03, 29.03, 29.03, 29.03, 29.03, 29.04, 29.04, 29.04, 29.04, 29.04, 29.1, 29.1, 29.1, 29.1, 29.1, 29.14, 29.14, 29.14, 29.14, 29.14, 29.24, 29.24, 29.24, 29.24, 29.24, 29.27, 29.27, 29.27, 29.27, 29.27, 29.28, 29.28, 29.28, 29.28, 29.28, 29.11, 29.11, 29.11, 29.11, 29.11, 29.13, 29.13, 29.13, 29.13, 29.13, 29.15, 29.15, 29.15, 29.15, 29.15, 29.24, 29.24, 29.24, 29.24, 29.24, 29.36, 29.36, 29.36, 29.36, 29.36, 29.45, 29.45, 29.45, 29.45, 29.45, 29.56, 29.56, 29.56, 29.56, 29.56, 29.62, 29.62, 29.62]

Details

kv_cache_usage_ratio

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 540 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1718010789 --> 1718011423
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.19, 0.19, 0.19, 0.19, 0.19, 0.39, 0.39, 0.39, 0.39, 0.39, 0.22, 0.22, 0.22, 0.22, 0.22, 0.18, 0.18, 0.18, 0.18, 0.18, 0.24, 0.24, 0.24, 0.24, 0.24, 0.17, 0.17, 0.17, 0.17, 0.17, 0.1, 0.1, 0.1, 0.1, 0.1, 0.12, 0.12, 0.12, 0.12, 0.12, 0.09, 0.09, 0.09, 0.09, 0.09, 0.16, 0.16, 0.16, 0.16, 0.16, 0.22, 0.22, 0.22, 0.22, 0.22, 0.21, 0.21, 0.21, 0.21, 0.21, 0.26, 0.26, 0.26, 0.26, 0.26, 0.28, 0.28, 0.28, 0.28, 0.28, 0.36, 0.36, 0.36, 0.36, 0.36, 0.28, 0.28, 0.28, 0.28, 0.28, 0.29, 0.29, 0.29, 0.29, 0.29, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.21, 0.21, 0.21, 0.21, 0.21, 0.22, 0.22, 0.22, 0.22, 0.22, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.38, 0.38, 0.38, 0.38, 0.38, 0.14, 0.14, 0.14, 0.14, 0.14, 0.13, 0.13, 0.13, 0.13, 0.13, 0.28, 0.28, 0.28, 0.28, 0.28, 0.12, 0.12, 0.12, 0.12, 0.12, 0.07, 0.07, 0.07, 0.07, 0.07, 0.17, 0.17, 0.17, 0.17, 0.17, 0.21, 0.21, 0.21, 0.21, 0.21, 0.16, 0.16, 0.16, 0.16, 0.16, 0.17, 0.17, 0.17, 0.17, 0.17, 0.31, 0.31, 0.31, 0.31, 0.31, 0.28, 0.28, 0.28, 0.28, 0.28, 0.4, 0.4, 0.4, 0.4, 0.4, 0.22, 0.22, 0.22, 0.22, 0.22, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.1, 0.1, 0.1, 0.1, 0.1, 0.32, 0.32, 0.32, 0.32, 0.32, 0.51, 0.51, 0.51, 0.51, 0.51, 0.61, 0.61, 0.61, 0.61, 0.61, 0.48, 0.48, 0.48, 0.48, 0.48, 0.27, 0.27, 0.27, 0.27, 0.27, 0.23, 0.23, 0.23, 0.23, 0.23, 0.17, 0.17, 0.17, 0.17, 0.17, 0.22, 0.22, 0.22, 0.22, 0.22, 0.18, 0.18, 0.18, 0.18, 0.18, 0.14, 0.14, 0.14, 0.14, 0.14, 0.21, 0.21, 0.21, 0.21, 0.21, 0.27, 0.27, 0.27, 0.27, 0.27, 0.08, 0.08, 0.08, 0.08, 0.08, 0.28, 0.28, 0.28, 0.28, 0.28, 0.15, 0.15, 0.15, 0.15, 0.15, 0.19, 0.19, 0.19, 0.19, 0.19, 0.09, 0.09, 0.09, 0.09, 0.09, 0.13, 0.13, 0.13, 0.13, 0.13, 0.17, 0.17, 0.17, 0.17, 0.17, 0.15, 0.15, 0.15]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 540 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1718010789 --> 1718011423
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0, 2.0, 2.0, 2.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 2.0, 2.0, 2.0]

Apr 24 '24 10:04 github-actions[bot]

Nice. With competent LLMs getting smaller and more efficient as well as Snapdragon laptops coming soon, it's important to make full use of the AI acceleration these SoCs provide through the Hexagon NPU Cluster.

This will make llama.cpp a robust backend for the future and will lead to power efficient LLMs on the go. Personally, I really can't wait!

Apr 24 '24 12:04 Dampfinchen

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 198 iterations 🚀

Expand details for performance related PR only

Nice. With competent LLMs getting smaller and more efficient as well as Snapdragon laptops coming soon, it's important to make full use of the AI acceleration these SoCs provide through the Hexagon NPU Cluster.

This will make llama.cpp a robust backend for the future and will lead to power efficient LLMs on the go. Personally, I really can't wait!

thanks for your comment. this PR is a very initial implementation and could/might/should be a good starting point of Qualcomm's QNN backend for GGML. it's better some domain technical experts from Qualcomm involved in this effort after it's accepted by community. I personally think this PR is also an example of GGML way: try crazy ideas, build wild demos, and push the edge of what’s possible.

another thing, a small and standalone Android example(or re-use the existing Android example in llama.cpp) is needed for purpose of facilitate community developers to participate in develop/verify QNN backend.

Apr 24 '24 12:04 jeffzhou2000

another thing, a small and standalone Android example(or re-use the existing Android example in llama.cpp) is needed for purpose of facilitate community developers to participate in develop/verify QNN backend.

Yes, it would be useful to have an example or instructions how to run this. In the meantime, simply setting up the test-backend-ops to run with ggml-qnn would be a good start for people who want to implement the missing operators

Apr 25 '24 11:04 ggerganov

another thing, a small and standalone Android example(or re-use the existing Android example in llama.cpp) is needed for purpose of facilitate community developers to participate in develop/verify QNN backend.

Yes, it would be useful to have an example or instructions how to run this. In the meantime, simply setting up the test-backend-ops to run with ggml-qnn would be a good start for people who want to implement the missing operators

thanks for your guidance. I'll study how to use test-backend-ops.cpp to validate QNN backend.

Apr 25 '24 11:04 jeffzhou2000

You would need to modify ggml_backend_registry_init to register the backend, then it should be automatically used by test-backend-ops.

https://github.com/ggerganov/llama.cpp/blob/54770413c484660d021dd51b5dbacab7880b8827/ggml-backend.c#L411

Apr 25 '24 12:04 slaren

You would need to modify ggml_backend_registry_init to register the backend, then it should be automatically used by test-backend-ops.

https://github.com/ggerganov/llama.cpp/blob/54770413c484660d021dd51b5dbacab7880b8827/ggml-backend.c#L411

thanks for your help, it's really helpful. I'm working on adapt to test-backend-ops.cpp with QNN backend on Android.

Apr 25 '24 13:04 jeffzhou2000

@ggerganov, @slaren, I'm sorry to interrupt you. adapt to test-backend-ops.cpp using QNN backend already done and it works fine as expected on Xiaomi14(Qualcomm SM8650-AB Snapdragon 8 Gen 3).

Could you take a moment to look at it? thanks.

BTW, the design and implementation of test-backend-ops.cpp is really excellent. I never noticed this file/feature before.

BTW, should the README-qnn.md be removed?

Apr 25 '24 15:04 jeffzhou2000

A not behind partner portal binary release of QNN is at: https://tetra-public-assets.s3.us-west-2.amazonaws.com/qai-hub-apps/qnn/linux/2.20.1.240223.zip

May 15 '24 13:05 woachk

A not behind partner portal binary release of QNN is at: https://tetra-public-assets.s3.us-west-2.amazonaws.com/qai-hub-apps/qnn/linux/2.20.1.240223.zip

thanks.

The prebuild QNN SDK(libs) here was fetched from /could be found at Qualcomm's official website:

https://qpm.qualcomm.com/#/main/tools/details/qualcomm_ai_engine_direct

Screenshot from 2024-05-18 19-16-21

May 18 '24 11:05 jeffzhou2000

@zhouwg was pulled offline by Qualcomm, am asking them what happened

May 18 '24 11:05 woachk

@woachk @zhouwg it was downgraded because of missing assets.

Currently working link: https://qaihub-public-assets.s3.us-west-2.amazonaws.com/qai-hub-apps/qnn/linux/2.20.0.240223.zip

SHA256: 8d11d429ac1ce2612f89d495c228ee61763b7d1d70b3a4a3a01a064060e4a8be

I've also archived the file here, just in case it disappears again: archive.org

2024-05-21_03-21

May 21 '24 10:05 arch-btw

This branch needs rebasing over the latest master branch

May 23 '24 04:05 hmartinez82

@zhouwg attempted to resolve the conflict, but you may want to consider rebasing anyway.

May 23 '24 04:05 mofosyne

@zhouwg attempted to resolve the conflict, but you may want to consider rebasing anyway.

thanks for reminder. I'll do it asap.

May 23 '24 05:05 jeffzhou2000

This branch needs rebasing over the latest master branch

thanks for reminder.

May 24 '24 06:05 jeffzhou2000

@zhouwg attempted to resolve the conflict, but you may want to consider rebasing anyway.

@mofosyne , thanks for reminder and your guidance.

rebase done(I'm working on other projects and today I spent a little time learning how to use git rebase properly --- I have to say that I'm not familiar with git rebase).

btw, this PR is intent to add Qualcomm's QNN backend for ggml inference framework and some labels with this PR are incorrect. could you help to fix it? thanks.

btw, I'd like to know whether you are one of the maintainers(have write privilege) of this project besides the original author and slaren(the author of backend subsystem)?

thanks.

May 24 '24 06:05 jeffzhou2000

@zhouwg you mean writing directly to master? Well no. But been helping out at least with triaging, which should still be helpful? If you noticed, I've been putting labels everywhere at least (The label bot is certainly making things easier however).

May 24 '24 07:05 mofosyne

@zhouwg you mean writing directly to master? Well no. But been helping out at least with triaging, which should still be helpful? If you noticed, I've been putting labels everywhere at least (The label bot is certainly making things easier however).

thanks for you comments.

this PR is intent to add a new backend using Qualcomm's QNN(AI Direct Engine) SDK.

so the following labels would be properly:

enhancement, ggml, QNN, review complexity:medium/high( reviewer should be a domain expert of Qualcomm's GPU/DSP/QNN)

thanks.

May 24 '24 07:05 jeffzhou2000

@mofosyne, thanks for your feedback and thanks for your time.

May 24 '24 08:05 jeffzhou2000

rebase to latest source code(of llama.cpp) with help from @mofosyne and Github action and minor manually codes in source code of llama.cpp. any code review or comments are greatly welcomed and appreciated.

May 28 '24 13:05 jeffzhou2000

I can see that the current implementation support the ADD, MUL and MUL_MAT ops. Is there are viable path to add support for the rest of the operations, such as RMS_NORM, SILU, etc. ?

May 28 '24 18:05 ggerganov

@ggerganov I hope so. QNN supports a ton of operations. I think the OP wants the GGML community to implement the other ops. For instance: https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-50/MasterOpDef.html#sigmoid I don't think there's RMSNorm, since it's fairly new, but it does have LayerNorm https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-50/MasterOpDef.html#layernorm

May 28 '24 19:05 hmartinez82

I can see that the current implementation support the ADD, MUL and MUL_MAT ops. Is there are viable path to add support for the rest of the operations, such as RMS_NORM, SILU, etc. ?

Thanks for your comments. The following are two paths to resolve this problem:

Refine ggml backend subsystem to enable mixed inference between CPU & GPU & NPU more easily although there already is "Backend Scheduler" feature in ggml backend subsystem but the "Backend Scheduler" is too complex and not a straight way. refine ggml backend subsystem to enable some backend APIs more make sense:

For example, ggml_backend_supports_op is only called/used in https://github.com/ggerganov/llama.cpp/blob/master/tests/test-backend-ops.cpp#L406,

For example, ggml_backend_offload_op is not reasonable.

In the all, a special backend doesn't need to implement all GGML OPs and much of them can fallback to the default GGML backend:

The entire framework of existing ggml backend subystem is really excellent, but part of subsystem seems too strict to a special backend;
GPU/NPU computing might be slower then CPU computing in some special scenarios if we considering data copy/data preparation between CPU/GPU or CPU/NPU and memory size or KV cache size.

I'd like to submit a standalone/concise PR(less then one hundred LoC based on the existing ggml backend subsystem and without side-effect) for this but I'm not sure whether this standalone PR could be accepted by the maintainer of ggml backend subsystem.

Implement other GGML OPs using QNN API one bye one(just like what Intel did in SYCL backend or the original author(you) did in Metal backend). I had been spent some hours to investigate how is QNN used in onnxruntime today but I personally don't think there are much reference values for ggml qnn backend currently(might be it's my misunderstanding) because a special method(with much/complex encapsulation) was used in Qualcomm's NN acceleration(Qualcomm's AI Hub is an example).

May 29 '24 12:05 jeffzhou2000

Hello Does the matmul implementation support all the quantizations ( Q8_0 , Q4_0) on QNN ? Did we check the accuracy of the matmul ?

May 29 '24 15:05 Nick-infinity

Hello Does the matmul implementation support all the quantizations ( Q8_0 , Q4_0) on QNN ? Did we check the accuracy of the matmul ?

thanks for your comments.

The current implementation only support FP32 / FP16(because I don't know what's the AI quantization tech and the technical detail with AI quantization --- I don't know anything real/hardcore AI tech), other quantized GGML data type not used currently, this is a real big limitation. this would/might be done in upstream GGML community(there are so many AI experts here) if this PR could be accepted. pls refer to: https://github.com/zhouwg/llama.cpp/blob/qualcomm_qnn_backend_for_ggml/ggml-qnn.cpp#L885
This is a real good question. some testcases has been verified on Xiaomi14(a Qualcomm Snapdragon 8 Gen 3 mobile SoC based Android phone, it should/might works fine on a Qualcommon desktop SoC based WoA device although that's the further step of ggml qnn backend).

The current implementation works fine as expected with llama.cpp and whisper.cpp on Android using QNN backend(CPU, GPU, NPU), but I didn't find that ggml_qnn_mul_mat is called for whisper inference and llm inference because there are strict sanity check(due to the reason 1) in ggml_qnn_compute_forward to make QNN API happy otherwise the QNN API will complain error info.

The automation test of ggml_qnn_mul_mat also works fine on Android.

The UT of ggml_qnn_mul_mat works fine ant the computation result is correct.

pls refer to: https://github.com/zhouwg/kantv/pull/215.

BTW, I'll submit a standalone/simple/concise PR for address a long-term problem(mixed inference between CPU&GPU / CPU&NPU easily) which mentioned by ggerganov recently. This PR less then one hundred LoC based on the existing ggml backend subsystem and without side-effect and try to follow the existing OO principle in ggml.c&ggml-backend.c, it works fine with whisper.cpp and llama.cpp as expected on my side. The GGML QNN backend and other potential backend will benefit from this PR.

I'm so sorry this standalone/simple/concise PR is not accepted by the maintainer of ggml backend subsystem. could anyone help to review this standalone PR which provide a straightforward way for mixed inference of some special ggml backends(the backend's ggml_backend_xxx_buffer_is_host return true) in ggml backend subsystem?

May 30 '24 01:05 jeffzhou2000

I'd like to update comments about the PR of refine ggml backend subsystem for mixed inference between CPU&GPU / CPU&NPU easily for some special ggml backends(the backend's ggml_backend_xxx_buffer_is_host return true) which relative to this PR:

A more clear code changes could be found at https://github.com/ggerganov/llama.cpp/pull/7679/files (because the original PR was closed and I can't update it(I'm not sure whether it's caused by my local git merge from latest upstream) and I want to follow the rules and principles of ggml community strictly and no intention to do spam PR). unfortunately, this new PR also be closed by the maintainer of ggml backend subsystem).
Any existing backends(such as QNN, metal, ...) or new backend can following the style in qnn backend if the backend's ggml_backend_xxx_buffer_is_host return true(or a specified backend which only need to use system memory). In other words, the concern from the maintainer of ggml backend subsystem is not quite correct.
Any new ggml backend will benefit from this new approach in the standalone PR because this approach is a very simple and straightforward(of course, we can improve the performance of mixed inference by more sophisticated algorithm in the future: for example a pre-computed/fetched metadata of ggml cgraph).
The "Backend Sched" feature provided by the maintainer of ggml backend subsystem can be used for other scenarios(for example in complicated scenarios in llama.cpp or a specified backend which need to use device(GPU/NPU) memory directly). in the fact, the "Backend Sched" feature already be used heavily in llama.cpp.
The new approach of mixed inference has NO conflict with the original "Backend Sched" feature. can be verified with the manually built Android APK from source code(https://github.com/zhouwg/kantv ) + "adb logcat | grep KANTV" or just reading the codes(less then 1 hundred LoC).
We can see a function "ggml_graph_compute" in ggml.c is also exported/referenced in ggml-backend.c. I'd like to know whether this means the maintainer of ggml backend subsystem can do anything the maintainer want to do(can export a function in ggml.c and referenced in ggml-backend.c arbitrarily as the maintainer's intention and prohibit other community programmer do the same thing for public/bigger interests?) if the answer of this question is YES, I have nothing to say although this is NOT make sense.
This new approach in the standalone PR has no side-effect to the existing codes and existing backend. It works very well for whisper/llm/minicpm-v inference using QNN CPU/GPU/NPU backend on Android phone and all the testcases passed on my local dev envs and can be verified with the manually built Android APK from source code(https://github.com/zhouwg/kantv ) + "adb logcat | grep KANTV". BTW, I have to say there are some unknown bugs in the test-backend-ops.cpp after validate with qnn backend many times.

May 31 '24 09:05 jeffzhou2000

Hello Does the matmul implementation support all the quantizations ( Q8_0 , Q4_0) on QNN ? Did we check the accuracy of the matmul ?

Hello, a dedicated Android command line program has been provided in this PR, and this dedicated command line program can answer your question more better:

weiguo:$ ./run-ggml-qnn.sh GGML_OP_ADD
/data/local/tmp//libQnnCpu.so
QNN libs already exist on Android phone
ggml-qnn-test: 1 file pushed. 16.2 MB/s (4567168 bytes in 0.269s)
[main, 344]: enter qnn_ggml_op

[main, 345]: ggml op:2(ADD)
[main, 359]: Allocating Memory of size 33554432 bytes, 32 MB

[ggml_backend_qnn_init, 3955]: device 0
[ggml_backend_qnn_init, 3956]: qnn_lib_path /data/local/tmp/
[qnn_init, 2172]: enter qni_init

[load_system, 2033]: system_lib_path:/data/local/tmp/libQnnSystem.so

[load_system, 2082]: find a valid qnn system interface

[load_system, 2092]: initialize qnn system successfully

[qnn_init, 2180]: load QNN system lib successfully

[load_backend, 1911]: lib_path:/data/local/tmp/libQnnCpu.so

[load_backend, 1935]: num_providers=1

[load_backend, 1960]: find a valid qnn interface

[load_backend, 2005]: saver_initialize is null

[qnn_init, 2213]: initialize qnn log successfully

[qnn_init, 2224]: initialize qnn backend successfully

[qnn_init, 2230]: device property is not supported

[qnn_init, 2241]: create device successfully

[qnn_init, 2245]: profiling turned on; level = 2
[qnn_init, 2256]: detailed profiling requested. Creating Qnn Profile object

[qnn_init, 2262]: initialize qnn profile successfully

[qnn_init, 2272]: load rpcmem lib successfully

[qnn_init, 2299]: initialize qnn context successfully

[qnn_init, 2302]: leave qni_init

[ggml_backend_qnn_init, 4011]: qnn device name QNN-CPU
[init_qnn_graph, 2406]: succeed to create graph QNN-CPU, 0xd4a50a4a47bcdc2f

[main, 395]: creating new tensors

[main, 396]: ggml_blck_size(f32) 1
[main, 397]: ggml_type_size(f32) 4
[main, 436]: creating backend buffer

[main, 448]: creating compute graph

[ggml_qnn_can_handle_op, 2458]: op name:ADD, tensor type:f32
[ggml_qnn_can_handle_op, 2460]: src0 type:f32
[ggml_qnn_can_handle_op, 2463]: src1 type:f32
[ggml_qnn_add, 2574]: call ggml_qnn_add

[ggml_qnn_add, 2578]:        tensor_0: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)

[ggml_qnn_add, 2582]:        tensor_1: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)

[ggml_qnn_add, 2586]:        tensor_2: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)

[ggml_qnn_add, 2587]: 4, 4, 1, 1
[ggml_qnn_add, 2588]: tensor0 name tensor_0
[ggml_qnn_add, 2589]: tensor1 name tensor_1
[ggml_qnn_add, 2590]: tensor2 name tensor_2
[ggml_qnn_add, 2617]: graph name ggml_op_qnn_add_1tensor_0_tensor_1
[ggml_qnn_logcallback, 2165]:     15.3ms [ DEBUG ] getNode OpPackage-Name : qti.aisw Node-Type : ElementWiseAdd 
[ggml_qnn_logcallback, 2165]:     15.4ms [VERBOSE] validate	Node-Type : ElementWiseAdd	Node-Name : ggml_op_add 
[ggml_qnn_logcallback, 2165]:     15.6ms [  INFO ] CpuGraph::finalize 
[ggml_qnn_logcallback, 2165]:     15.7ms [ DEBUG ] Setting data pointer for tensor ID: 1 
[ggml_qnn_logcallback, 2165]:     15.7ms [ DEBUG ] Setting data pointer for tensor ID: 2 
[ggml_qnn_logcallback, 2165]:     15.8ms [ DEBUG ] Setting data pointer for tensor ID: 3 
[ggml_qnn_logcallback, 2165]:     15.8ms [  INFO ] CpuGraph::execute 
[get_tensor_rank, 210]: tensor->rank 4

[get_tensor_rank, 211]: get_tensor_rank 2

[get_tensor_data_size, 223]: get_tensor_data_size 64
[get_tensor_data_size, 224]: ggml_nbytes(tensor) 64
[main, 464]: dump:

[tensor_dump, 191]: dump ggml tensor src0(tensor_0)
[tensor_dump, 195]:            src0: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
[tensor_sum_elements, 151]:     0.80    -0.50    -0.32    -0.93 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 151]:     0.87     0.88    -0.09     0.11 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 151]:    -0.89     0.14     0.13     0.37 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 151]:    -0.82    -0.83    -0.81     0.18 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 185]: 

[tensor_dump, 198]: 

[tensor_dump, 191]: dump ggml tensor src1(tensor_1)
[tensor_dump, 195]:            src1: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
[tensor_sum_elements, 151]:    -0.23    -0.98    -0.43     0.93 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 151]:    -0.27    -0.33    -0.73     0.73 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 151]:    -0.40    -0.12    -0.64    -0.81 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 151]:    -0.16    -0.42     0.32    -0.75 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 185]: 

[tensor_dump, 198]: 

[tensor_dump, 191]: dump ggml tensor dst(tensor_2)
[tensor_dump, 195]:             dst: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
[tensor_sum_elements, 151]:     0.57    -1.49    -0.75    -0.00 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 151]:     0.60     0.54    -0.82     0.84 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 151]:    -1.30     0.02    -0.51    -0.44 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 151]:    -0.99    -1.25    -0.48    -0.56 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 185]: 

[tensor_dump, 198]: 

[ggml_backend_qnn_free, 3753]: enter ggml_backend_qnn_free
[ggml_backend_qnn_free, 3755]: idx 0, name:qnn-cpu
[ggml_backend_qnn_free, 3764]: graph type:ADD
[qnn_finalize, 2318]: succeed to close rpcmem lib

[ggml_backend_qnn_free, 3786]: leave ggml_backend_qnn_free

weiguo:$ ./run-ggml-qnn.sh GGML_OP_MULMAT 1
/data/local/tmp//libQnnCpu.so
QNN libs already exist on Android phone
ggml-qnn-test: 1 file pushed. 18.0 MB/s (4564520 bytes in 0.242s)
 
 not supported currently
Usage:
  ./run-ggml-qnn.sh GGML_OP_ADD      0/1/2
  ./run-ggml-qnn.sh GGML_OP_MUL      0/1/2
  ./run-ggml-qnn.sh GGML_OP_MUL_MAT  0/1/2




weiguo:$ ./run-ggml-qnn.sh GGML_OP_MUL 1
/data/local/tmp//libQnnCpu.so
QNN libs already exist on Android phone
ggml-qnn-test: 1 file pushed. 21.7 MB/s (4564520 bytes in 0.201s)
[main, 352]: enter qnn_ggml_op

[main, 353]: ggml op:6(MUL)
[main, 360]: Allocating Memory of size 33554432 bytes, 32 MB

[ggml_backend_qnn_init, 3523]: device 1
[ggml_backend_qnn_init, 3524]: qnn_lib_path /data/local/tmp/
[qnn_init, 1783]: enter qni_init

[load_system, 1645]: system_lib_path:/data/local/tmp/libQnnSystem.so

[load_system, 1694]: find a valid qnn system interface

[load_system, 1704]: initialize qnn system successfully

[qnn_init, 1791]: load QNN system lib successfully

[load_backend, 1523]: lib_path:/data/local/tmp/libQnnGpu.so

[load_backend, 1547]: num_providers=1

[load_backend, 1572]: find a valid qnn interface

[load_backend, 1617]: saver_initialize is null

[qnn_init, 1824]: initialize qnn log successfully

[ggml_qnn_logcallback, 1776]:      0.0ms [  INFO ] QNN API Version: 2.14.0 
[ggml_qnn_logcallback, 1776]:      0.1ms [  INFO ] QNN GPU API Version: 3.3.0 
[ggml_qnn_logcallback, 1776]:      0.4ms [  INFO ] Found /vendor/lib64/libOpenCL.so 
[ggml_qnn_logcallback, 1776]:      9.2ms [  INFO ] Device version: 3.0    Device tier: 750 
[ggml_qnn_logcallback, 1776]:     12.1ms [  INFO ] OpenCL Driver version: OpenCL 3.0 QUALCOMM build: commit unknown Compiler E031.42.23.09 
[ggml_qnn_logcallback, 1776]:     12.3ms [  INFO ] QnnOpPackage: v2.0.0 
[ggml_qnn_logcallback, 1776]:      0.0ms [  INFO ] Creating operation package: qti.aisw 
[ggml_qnn_logcallback, 1776]:      0.1ms [  INFO ] Found /vendor/lib64/libOpenCL.so 
[ggml_qnn_logcallback, 1776]:     12.6ms [  INFO ] QnnOpPackage: qti.aisw 
[qnn_init, 1835]: initialize qnn backend successfully

[qnn_init, 1841]: device property is not supported

[qnn_init, 1852]: create device successfully

[qnn_init, 1856]: profiling turned on; level = 2
[qnn_init, 1867]: detailed profiling requested. Creating Qnn Profile object

[qnn_init, 1873]: initialize qnn profile successfully

[qnn_init, 1883]: load rpcmem lib successfully

[qnn_init, 1910]: initialize qnn context successfully

[qnn_init, 1913]: leave qni_init

[ggml_backend_qnn_init, 3578]: qnn device name QNN-GPU
[ggml_qnn_logcallback, 1776]:     13.9ms [  INFO ] Graph precision mode is user provided 
[ggml_qnn_logcallback, 1776]:     13.9ms [  INFO ] Memory Optimizations enabled 
[ggml_qnn_logcallback, 1776]:     13.9ms [  INFO ] Node Optimizations enabled 
[ggml_qnn_logcallback, 1776]:     13.9ms [  INFO ] Queue Recording enabled 
[init_qnn_graph, 2017]: succeed to create graph QNN-GPU, 0xd4a56ad95bbcdc2f

[main, 383]: creating new tensors

[main, 384]: ggml_blck_size(f32) 1
[main, 385]: ggml_type_size(f32) 4
[main, 426]: creating compute graph

[ggml_qnn_can_handle_op, 2072]: op name:MUL, tensor type:f32
[ggml_qnn_can_handle_op, 2074]: src0 type:f32
[ggml_qnn_can_handle_op, 2077]: src1 type:f32
[ggml_qnn_hanlde_op, 2584]: call ggml_qnn_hanlde_op

[ggml_qnn_hanlde_op, 2588]:        tensor_0: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)

[ggml_qnn_hanlde_op, 2592]:        tensor_1: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)

[ggml_qnn_hanlde_op, 2596]:        tensor_2: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)

[ggml_qnn_hanlde_op, 2597]: 4, 4, 1, 1
[ggml_qnn_hanlde_op, 2598]: tensor0 name tensor_0
[ggml_qnn_hanlde_op, 2599]: tensor1 name tensor_1
[ggml_qnn_hanlde_op, 2600]: tensor2 name tensor_2
[ggml_qnn_hanlde_op, 2617]: qnn graph name ggml_qnn_graph_MUL4tensor_0_tensor_1
[ggml_qnn_hanlde_op, 2618]: qnn op_config name ggml_qnn_op_config_MUL4tensor_0_tensor_1
[ggml_qnn_logcallback, 1776]:     22.3ms [  INFO ] Graph precision mode is user provided 
[ggml_qnn_logcallback, 1776]:     22.3ms [  INFO ] Memory Optimizations enabled 
[ggml_qnn_logcallback, 1776]:     22.4ms [  INFO ] Node Optimizations enabled 
[ggml_qnn_logcallback, 1776]:     22.4ms [  INFO ] Queue Recording enabled 
[ggml_qnn_logcallback, 1776]:     28.1ms [  INFO ] QnnGraph_finalize: start 
[ggml_qnn_logcallback, 1776]:     15.9ms [  INFO ] Create operation: ElementWiseMultiply 
[ggml_qnn_logcallback, 1776]:    128.6ms [  INFO ] finalize: total host time: 100.5 [ms] 
[ggml_qnn_logcallback, 1776]:    128.8ms [  INFO ] QnnGraph_finalize: finish 
[ggml_qnn_logcallback, 1776]:    128.9ms [  INFO ] QnnGraph_execute: start 
[ggml_qnn_logcallback, 1776]:    136.9ms [  INFO ] execute: total host time: 8.0 [ms] 
[ggml_qnn_logcallback, 1776]:    137.0ms [  INFO ] QnnGraph_execute: finish 
[ggml_qnn_hanlde_op, 2718]: duration of ggml_qnn_MUL : 114 milliseconds

[ggml_qnn_hanlde_op, 2719]: call ggml_qnn_hanlde_op done

[get_tensor_data_size, 213]: get_tensor_data_size 64
[get_tensor_data_size, 214]: ggml_nbytes(tensor) 64
[main, 442]: dump tensors:

[tensor_dump, 183]: dump ggml tensor src0(tensor_0)
[tensor_dump, 188]:            src0: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
[tensor_sum_elements, 167]:    -0.06    -0.54     0.84    -0.75 
[tensor_sum_elements, 171]: 

[tensor_sum_elements, 167]:    -0.88    -0.20    -0.18     0.93 
[tensor_sum_elements, 171]: 

[tensor_sum_elements, 167]:    -0.99     0.52     0.98     0.90 
[tensor_sum_elements, 171]: 

[tensor_sum_elements, 167]:     0.20    -0.70    -0.78     0.50 
[tensor_sum_elements, 171]: 

[tensor_sum_elements, 177]: 

[tensor_dump, 191]: 

[tensor_dump, 183]: dump ggml tensor src1(tensor_1)
[tensor_dump, 188]:            src1: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
[tensor_sum_elements, 167]:    -0.60     0.10    -0.11    -0.62 
[tensor_sum_elements, 171]: 

[tensor_sum_elements, 167]:    -0.99    -0.86    -0.54    -0.25 
[tensor_sum_elements, 171]: 

[tensor_sum_elements, 167]:    -0.79     0.00    -0.86    -0.79 
[tensor_sum_elements, 171]: 

[tensor_sum_elements, 167]:    -0.00     0.37     0.00    -0.33 
[tensor_sum_elements, 171]: 

[tensor_sum_elements, 177]: 

[tensor_dump, 191]: 

[tensor_dump, 183]: dump ggml tensor dst(tensor_2)
[tensor_dump, 188]:             dst: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
[tensor_sum_elements, 167]:     0.04    -0.06    -0.09     0.46 
[tensor_sum_elements, 171]: 

[tensor_sum_elements, 167]:     0.87     0.18     0.10    -0.24 
[tensor_sum_elements, 171]: 

[tensor_sum_elements, 167]:     0.78     0.00    -0.85    -0.71 
[tensor_sum_elements, 171]: 

[tensor_sum_elements, 167]:    -0.00    -0.26    -0.00    -0.17 
[tensor_sum_elements, 171]: 

[tensor_sum_elements, 177]: 

[tensor_dump, 191]: 

[ggml_backend_qnn_free, 3326]: enter ggml_backend_qnn_free
[ggml_backend_qnn_free, 3328]: idx 1, name:qnn-gpu
[ggml_backend_qnn_free, 3337]: graph type:MUL
[qnn_finalize, 1929]: succeed to close rpcmem lib

[ggml_backend_qnn_free, 3351]: leave ggml_backend_qnn_free
[main, 467]: duration of ut GGML_OP_MUL using QNN backend QNN-GPU: 146 milliseconds

This dedicated Android command line program for QNN backend UT work very well and you can reproduce the results very easily without any dependences and it will be used for add quantize data supportive for QNN backend in the future.

Jun 01 '24 09:06 jeffzhou2000

@zhouwg attempted to resolve the conflict, but you may want to consider rebasing anyway.

@mofosyne , thanks for reminder and your help. now I know how to do rebase(it seems not very difficult but need more practice) in this PR and compress the 20+ commits to 2-3 commits for purpose of make reviewers happy accordingly. This PR should/might be clean now. thanks again.

Jun 02 '24 03:06 jeffzhou2000

Rebase again: (1)fix a long-term/stupid bug in original PR(bug != issue: bug must to be fixed, issue can be improved gradually), this PR should be bug-free now(there some known performance issues in QNN NPU backend and that's a long-term task);

(2)refine the ggml-qnn.cpp and scripts&codes in Android command line UT program for purpose of follow the coding style in ggm community more strictly.

(3)GGML_TYPE_Q8_0 supportive:

./ggml-qnn-ut-build-run.sh GGML_OP_ADD 2

[get_tensor_data_size, 283]: get_tensor_data_size 136
[get_tensor_data_size, 284]: ggml_nbytes(tensor) 136
[main, 513]: dump tensors:

[tensor_dump, 170]: dump ggml tensor src0(tensor_0): type = 8 ( q8_0) ne =    32 x     4 x     1, nb = (   34,    34,   136)

[tensor_dump, 257]: 
    0.63     0.59    -0.39     0.02    -0.49     0.60    -0.49     0.12     0.15     0.40    -0.84     0.29    -0.27     0.57     0.51     0.86     0.40     0.43     0.43     0.49    -0.32     0.96     0.85     0.18    -0.93     0.61    -0.63     0.09     0.38     0.51     0.75    -0.59 
   -0.17    -0.89    -0.20    -0.36    -0.92     0.28     0.39    -0.37     0.37     0.46     0.72    -0.99     0.11    -0.64     0.30    -0.92    -0.72     0.19    -0.23    -0.10    -0.99     0.74    -0.30    -0.61     0.07     0.78     0.08    -0.50     0.57    -0.68    -0.09     0.24 
    0.39    -0.89    -0.01    -0.60    -0.65     0.21    -0.97     0.29     0.91     0.95     0.67     0.33    -0.81    -0.18     0.80    -0.27     0.76    -0.27     0.10    -0.78     0.53    -0.18    -0.87     0.47    -0.81     0.26     0.61     0.49     0.88    -0.44     0.66     0.89 
   -0.97    -0.26    -0.96     0.46    -0.74    -0.16    -0.83    -0.29     0.04    -0.60    -0.13    -0.09    -0.15    -0.49     0.25    -0.07     0.98    -0.94    -0.48     0.33     0.89     0.70    -0.80    -0.90     0.90     0.83    -0.50    -0.75     0.00    -0.46    -0.15     0.36 


[tensor_dump, 170]: dump ggml tensor src1(tensor_1): type = 0 (  f32) ne =    32 x     4 x     1, nb = (    4,   128,   512)

[tensor_dump, 214]: 
    0.76     0.20     0.60     0.27     0.87    -0.65    -0.78     0.05    -0.60    -0.19     0.02    -0.33     0.02    -0.17     0.80    -0.88    -0.83     0.08    -0.90     0.94    -0.94     0.25    -0.90    -0.87     0.13     0.48     0.22    -0.32    -0.87     0.45     0.80     0.64 
   -0.89     0.57    -0.35     0.16     0.65     0.55    -0.11     0.30    -0.03    -0.14     0.50    -0.39     0.60    -0.59    -0.48     0.57    -0.88     0.62     0.57     0.05     0.86    -0.12    -0.00     0.70    -0.60     0.87     0.51     0.30    -0.96     0.23     0.99     0.55 
   -0.64     0.18    -0.79    -0.34    -0.78    -0.71     0.50    -0.25    -0.22     0.04     0.73    -0.12     0.22    -0.95    -0.71     0.61     0.48    -0.26     0.10     0.99    -0.22     0.98    -0.07     0.18    -0.58    -0.47     0.12     0.15     0.78     0.86     0.83     0.11 
   -0.60     0.80    -0.35     0.13     0.73     0.38     0.93    -0.26    -0.14    -0.18     0.50    -0.06     0.79     0.71    -0.90    -0.28    -0.36     0.43    -0.31    -0.73     0.27    -0.27    -0.68     0.88     0.85    -0.86     0.97     0.70     0.61     0.29     0.74     0.23 


[tensor_dump, 170]: dump ggml tensor dst(tensor_2): type = 8 ( q8_0) ne =    32 x     4 x     1, nb = (   34,    34,   136)

[tensor_dump, 257]: 
    1.39     0.79     0.22     0.29     0.38    -0.05    -1.27     0.17    -0.45     0.21    -0.82    -0.05    -0.26     0.40     1.30    -0.02    -0.43     0.51    -0.46     1.44    -1.27     1.22    -0.05    -0.68    -0.80     1.08    -0.40    -0.23    -0.50     0.96     1.55     0.05 
   -1.05    -0.31    -0.55    -0.21    -0.27     0.83     0.29    -0.06     0.34     0.32     1.22    -1.38     0.70    -1.23    -0.17    -0.36    -1.60     0.81     0.34    -0.05    -0.13     0.61    -0.30     0.09    -0.53     1.65     0.58    -0.19    -0.39    -0.45     0.90     0.79 
   -0.25    -0.71    -0.80    -0.93    -1.43    -0.49    -0.47     0.05     0.70     0.98     1.39     0.21    -0.60    -1.13     0.09     0.34     1.25    -0.52     0.20     0.21     0.30     0.80    -0.94     0.64    -1.39    -0.21     0.72     0.64     1.67     0.42     1.48     1.00 
   -1.57     0.54    -1.31     0.58    -0.01     0.22     0.10    -0.54    -0.11    -0.79     0.37    -0.15     0.63     0.22    -0.66    -0.34     0.62    -0.51    -0.79    -0.40     1.16     0.43    -1.48    -0.01     1.75    -0.03     0.47    -0.06     0.61    -0.18     0.59     0.59 


[ggml_backend_qnn_free, 3396]: enter ggml_backend_qnn_free
[ggml_backend_qnn_free, 3421]: leave ggml_backend_qnn_free

(4)Android command line UT program can works well on any mainstream Qualcomm mobile SoC equipped Android phone(NPU backend only verified on Qualcomm Snapdragon 8 Gen 3, a Quacomm flagship mobile SoC released on Oct 2023) without any dependencies (doesn't depend on a standalone PR for mixed inference between Qualcomm CPU&GPU / CPU&NPU very easily base on refine the existing ggml backend subsystem from me or depend on the on-going refine in ggml backend subsystem from the maintainer of the ggml backend subsystem(which I personally think it might be exactly same with my standalone PR essentially/technically but bring more complexity&modifications to the existing code).

Now I personally think this PR (or ggml-qnn backend) is a real functional backend:

Source code is clean and bug-free(personal opinion) and follow the coding style in ggml community
The entire data path of ggml-qnn backend(follow the framework of ggml backend subsystem) works fine with the Android command line UT program
The three ggml-qnn OPs(add,mul,mulmat) also works fine with the Android command line UT program
The dedicated UT program could be used for troubleshooting or further development activity
Provide a general approach in a standalone PR for mixed inference between Qualcomm's CPU&GPU / CPU&NPU based on the existing ggml backend subsystem and less then 100 LoC and has no side-effect to the existing backends/codes. It works very well in my personal ggml learning&study project
NO unnecessary dependencies(except for Qualcomm's QNN SDK for build UT program)
More complicated UT cases(whisper.cpp inference using QNN backend in Android command line mode, whisper.cpp inference and llama.cpp inference using QNN backend in Android APK) or real scenario(real-time AI subtitle powered by the great whisper.cpp) using ggml-qnn backend can be verified in my personal ggml learning&study project:https://github.com/zhouwg/kantv/pull/229

Any community developer/AI expert who is interesting with this topic can verify it very easily. Issue report or bug report of ggml qnn backend is also greatly welcomed and appreciated.

Jun 04 '24 02:06 jeffzhou2000

llama.cpp llama.cpp copied to clipboard

ggml-qnn: add Qualcomm QNN(Qualcomm Neural Network,aka Qualcomm AI Engine Direct) backend

Purpose

Status

Todo

How to verify QNN backend or participate in develop/verify QNN backend

llama.cpp
llama.cpp copied to clipboard