llama.cpp [WIP]backend: Integrating QNN (Qualcomm AI Engine Direct) as a dedicated backend for Qualcomm NPUs

[WIP]backend: Integrating QNN (Qualcomm AI Engine Direct) as a dedicated backend for Qualcomm NPUs

Open chraac opened this issue 1 day ago • 4 comments

Warning: This is an early draft of my fork and will continue to be updated to meet the requirements in the contributing guidelines

Summary

This fork is based on zhouwg's initial PR and performs further refactoring and improvements to introduce support for the Qualcomm QNN backend to GGML.

This backend is organized into three distinct integration layers:

graph TB
    subgraph GGML Adaptation Layer
        A1[Graph Caching, Mapping, and Execution]
        A2[Tensor Binding and Execution Flow]
    end

    subgraph QNN Object Layer
        B1[QNN System and Instance Management]
        B2[Dynamic Resource Handling]
    end

    subgraph Utility Layer
        C1[Dynamic Library Loading & Search Path Management]
        C2[General Utilities]
    end

    %% Relations to illustrate stack dependency
    A1 -->|Uses| B1
    A2 -->|Uses| B1
    B1 -->|Relies on| C1

GGML Adaptation Layer
- Graph Caching, Mapping, and Execution:
  - Provides a robust mechanism to map a GGML computation graph into a corresponding QNN graph, allowing efficient offloading of operations to the QNN accelerator.
  - Implements graph caching strategies (in backend-ops.cpp) to minimize redundant graph creation and boost execution performance.
  - Seamlessly translates GGML operations into corresponding QNN op objects using specialized op constructors and configuration functions (configured in op-config-caps.cpp and op-config-impl.cpp).
- Tensor Binding and Execution Flow:
  - Adapts GGML tensor objects to the QNN backend (see tensor.hpp and graph.hpp), managing both host and RPC memory via buffer interfaces like qnn_buffer_interface.
  - Ensures proper data flow between GGML graphs and QNN execution contexts through carefully handled tensor binding/unbinding procedures.
QNN Object Layer
- QNN System and Instance Management:
  - Encapsulates the QNN system via the qnn_system_interface class, originally derived from executorch, to create and free the QNN system context.
  - Manages QNN instance creation and initialization via the qnn_instance class
  - Implements backend loading routines (e.g., load_backend() and load_system()) that retrieve provider lists and choose valid QNN interfaces based on API version checks.
  - Uses caching mechanisms for loaded backends and tracks library handles to guarantee proper cleanup during finalization.
- Dynamic Resource Handling:
  - Integrates fallback mechanisms in load_lib_with_fallback() to reliably load both the system and RPC libraries.
  - Manages RPC memory allocation and deallocation via function pointer resolution from the loaded RPC library.
Utility Layer
- Dynamic Library Loading & Search Path Management:
  - Implements functions in qnn-lib.cpp to manage dynamic library loading with fallbacks.
  - Uses helper routines such as insert_path() and set_qnn_lib_search_path() to configure environment variables (like LD_LIBRARY_PATH on Linux and ADSP_LIBRARY_PATH on Android) based on a custom library search path.
- General Utilities:
  - Provides detailed error and debug logging through QNN logging macros.

Key Features and Improvements

Graph Mapping Mechanism:
- Efficient mapping of GGML graphs into QNN graphs is a standout feature, enabling the offloading and execution of computation graphs on hardware accelerators (see graph.hpp and backend-ops.cpp).
- Graph caching strategies help reuse QNN graphs to reduce redundancy and enhance performance.
- The translation of GGML operations into corresponding QNN ops supports various data types and parameter configurations.
Backend Context and Device Management:
- Comprehensive QNN instance initialization supports API negotiation, enhanced error handling, and detailed device property logging.
- Detailed logs (chipset description, HTP architecture, VTCM memory size) facilitate debugging and performance tuning.

Testing

Basic functionality of the QNN backend has been verified on Android, Linux, and Windows platforms using test-backend-ops—this is integrated into the pipeline for each commit node of the dev-refactoring branch.

Platform test-backend-ops full console output

Android test-backend-ops_all_android_ff033e1.log

Linux test-backend-ops_all_linux_ff033e1.log
Proper graph creation and execution paths are confirmed through detailed log messages.
Memory registration and cleanup within tensor binding functions have been thoroughly checked.

Platform	test-backend-ops	full console output
Android		test-backend-ops_all_android_ff033e1.log
Linux		test-backend-ops_all_linux_ff033e1.log

Current state

The test-backend-ops suite passes on all platforms, including support for both qnn-npu and qnn-gpu devices.
Testing with llama3.2-1b/3b-f16/32 models yields expected results.
Quantized matrix multiplication is under development; for quantized modules, the CPU backend may be used as a fallback.

Future development

Further feature support and device-specific optimizations are planned (see also the project backlog).
Future iterations will add support for quantization data types, with efforts underway to map GGML's block quantization structure into QNN.

Feb 25 '25 07:02 chraac

llama.cpp llama.cpp copied to clipboard

[WIP]backend: Integrating QNN (Qualcomm AI Engine Direct) as a dedicated backend for Qualcomm NPUs

Summary

Key Features and Improvements

Testing

Current state

Future development

llama.cpp
llama.cpp copied to clipboard