[QNN-EP] Add MatMulNBits translation for GPU
Description
Add support for translation of MatMulNBits contrib op to QNN with FullyConnected operation with INT4 BlockQuantized weights
Implementation details:
- Translate MatMulNBits to FullyConnected in OpBuilder
- Support QNN_QUANTIZATION_ENCODING_BLOCK for INT4 weights
- Pass INT4 weights and quant params as BlockQuantization encoding params in QNN
Testing:
- Added new unit tests for MNB -> QNN-GPU
- Validated all OnnxRuntime tests
- Validated the following LLMs through Olive and ORT-GenAI execution flow
- LlaMA3.2 1B
- Qwen2.5
- DeepSeek-R1-Qwen 1.5b
- Phi3.5-mini-instruct
Motivation and Context
LLMs with INT4 quantization pass in Olive will generate a model with MatMulMBits contrib ops. To run these ops via QNN-EP, MatMulNBits is translated to QNN FullyConnected op with INT4 weights.
@chilo-ms Could you please trigger CI ?
/azp run Windows ARM64 QNN CI Pipeline
Azure Pipelines successfully started running 1 pipeline(s).
/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows GPU Doc Gen CI Pipeline
Azure Pipelines successfully started running 3 pipeline(s).
@edgchen1 Please holdoff on merging this pull request. There might be an issue with this change.
We are seeing NaN outputs for Qwen and DeepSeek 1.5B using MatmulNBits that was triaged to this change.
We are seeing NaN outputs for Qwen and DeepSeek 1.5B using MatmulNBits that was triaged to this change.
The NaN issue is identified and fixed from Qnn Gpu backend. Confirmed that there are no issues with this PR
We are seeing NaN outputs for Qwen and DeepSeek 1.5B using MatmulNBits that was triaged to this change.
The NaN issue is identified and fixed from Qnn Gpu backend. Confirmed that there are no issues with this PR
I tested with the next release of Qnn Gpu 2.40. The issue still seems to be present. Lets discuss offline and clarify things before this change is merged.
@edgchen1 Please holdoff on merging this pull request. There might be an issue with this change.
@edgchen1 Thanks for holding off. This issue is verified as fixed with QNN SDK 2.41. Please procced with the merge.
@edgchen1 Thanks for the review and suggestions. We addressed the comments and rebased the PR. Could you please kindly review and approve the PR. Please help to trigger CI as well.
/azp run Linux QNN CI Pipeline,Windows ARM64 QNN CI Pipeline
Command 'Linux' is not supported by Azure Pipelines.
Supported commands
- help:
- Get descriptions, examples and documentation about supported commands
- Example: help "command_name"
- list:
- List all pipelines for this repository using a comment.
- Example: "list"
- run:
- Run all pipelines or specific pipelines for this repository using a comment. Use this command by itself to trigger all related pipelines, or specify specific pipelines to run.
- Example: "run" or "run pipeline_name, pipeline_name, pipeline_name"
- where:
- Report back the Azure DevOps orgs that are related to this repository and org
- Example: "where"
See additional documentation.
Azure Pipelines successfully started running 2 pipeline(s).
/azp run Linux QNN CI Pipeline,Windows ARM64 QNN CI Pipeline
Azure Pipelines successfully started running 2 pipeline(s).
/azp run Linux QNN CI Pipeline,Windows ARM64 QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows GPU Doc Gen CI Pipeline
Azure Pipelines successfully started running 4 pipeline(s).