llama.cpp
llama.cpp copied to clipboard
[WIP]backend: Integrating QNN (Qualcomm AI Engine Direct) as a dedicated backend for Qualcomm NPUs
Warning: This is an early draft of my fork and will continue to be updated to meet the requirements in the contributing guidelines
Summary
This fork is based on zhouwg's initial PR and performs further refactoring and improvements to introduce support for the Qualcomm QNN backend to GGML.
This backend is organized into three distinct integration layers:
graph TB
subgraph GGML Adaptation Layer
A1[Graph Caching, Mapping, and Execution]
A2[Tensor Binding and Execution Flow]
end
subgraph QNN Object Layer
B1[QNN System and Instance Management]
B2[Dynamic Resource Handling]
end
subgraph Utility Layer
C1[Dynamic Library Loading & Search Path Management]
C2[General Utilities]
end
%% Relations to illustrate stack dependency
A1 -->|Uses| B1
A2 -->|Uses| B1
B1 -->|Relies on| C1
-
GGML Adaptation Layer
-
Graph Caching, Mapping, and Execution:
- Provides a robust mechanism to map a GGML computation graph into a corresponding QNN graph, allowing efficient offloading of operations to the QNN accelerator.
- Implements graph caching strategies (in
backend-ops.cpp
) to minimize redundant graph creation and boost execution performance. - Seamlessly translates GGML operations into corresponding QNN op objects using specialized op constructors and configuration functions (configured in
op-config-caps.cpp
andop-config-impl.cpp
).
-
Tensor Binding and Execution Flow:
- Adapts GGML tensor objects to the QNN backend (see
tensor.hpp
andgraph.hpp
), managing both host and RPC memory via buffer interfaces likeqnn_buffer_interface
. - Ensures proper data flow between GGML graphs and QNN execution contexts through carefully handled tensor binding/unbinding procedures.
- Adapts GGML tensor objects to the QNN backend (see
-
-
QNN Object Layer
-
QNN System and Instance Management:
- Encapsulates the QNN system via the
qnn_system_interface
class, originally derived from executorch, to create and free the QNN system context. - Manages QNN instance creation and initialization via the
qnn_instance
class - Implements backend loading routines (e.g.,
load_backend()
andload_system()
) that retrieve provider lists and choose valid QNN interfaces based on API version checks. - Uses caching mechanisms for loaded backends and tracks library handles to guarantee proper cleanup during finalization.
- Encapsulates the QNN system via the
-
Dynamic Resource Handling:
- Integrates fallback mechanisms in
load_lib_with_fallback()
to reliably load both the system and RPC libraries. - Manages RPC memory allocation and deallocation via function pointer resolution from the loaded RPC library.
- Integrates fallback mechanisms in
-
-
Utility Layer
-
Dynamic Library Loading & Search Path Management:
- Implements functions in
qnn-lib.cpp
to manage dynamic library loading with fallbacks. - Uses helper routines such as
insert_path()
andset_qnn_lib_search_path()
to configure environment variables (likeLD_LIBRARY_PATH
on Linux andADSP_LIBRARY_PATH
on Android) based on a custom library search path.
- Implements functions in
-
General Utilities:
- Provides detailed error and debug logging through QNN logging macros.
-
Key Features and Improvements
-
Graph Mapping Mechanism:
- Efficient mapping of GGML graphs into QNN graphs is a standout feature, enabling the offloading and execution of computation graphs on hardware accelerators (see graph.hpp and backend-ops.cpp).
- Graph caching strategies help reuse QNN graphs to reduce redundancy and enhance performance.
- The translation of GGML operations into corresponding QNN ops supports various data types and parameter configurations.
-
Backend Context and Device Management:
- Comprehensive QNN instance initialization supports API negotiation, enhanced error handling, and detailed device property logging.
- Detailed logs (chipset description, HTP architecture, VTCM memory size) facilitate debugging and performance tuning.
Testing
-
Basic functionality of the QNN backend has been verified on Android, Linux, and Windows platforms using
test-backend-ops
—this is integrated into the pipeline for each commit node of thedev-refactoring
branch.Platform test-backend-ops full console output Android test-backend-ops_all_android_ff033e1.log Linux test-backend-ops_all_linux_ff033e1.log -
Proper graph creation and execution paths are confirmed through detailed log messages.
-
Memory registration and cleanup within tensor binding functions have been thoroughly checked.
Current state
- The test-backend-ops suite passes on all platforms, including support for both qnn-npu and qnn-gpu devices.
- Testing with llama3.2-1b/3b-f16/32 models yields expected results.
- Quantized matrix multiplication is under development; for quantized modules, the CPU backend may be used as a fallback.
Future development
- Further feature support and device-specific optimizations are planned (see also the project backlog).
- Future iterations will add support for quantization data types, with efforts underway to map GGML's block quantization structure into QNN.