onnxruntime [QNN-EP] Add MatMulNBits translation for GPU

Description

Add support for translation of MatMulNBits contrib op to QNN with FullyConnected operation with INT4 BlockQuantized weights

Implementation details:

Translate MatMulNBits to FullyConnected in OpBuilder
Support QNN_QUANTIZATION_ENCODING_BLOCK for INT4 weights
Pass INT4 weights and quant params as BlockQuantization encoding params in QNN

Testing:

Added new unit tests for MNB -> QNN-GPU
Validated all OnnxRuntime tests
Validated the following LLMs through Olive and ORT-GenAI execution flow
- LlaMA3.2 1B
- Qwen2.5
- DeepSeek-R1-Qwen 1.5b
- Phi3.5-mini-instruct

Motivation and Context

LLMs with INT4 quantization pass in Olive will generate a model with MatMulMBits contrib ops. To run these ops via QNN-EP, MatMulNBits is translated to QNN FullyConnected op with INT4 weights.

Oct 17 '25 16:10 quic-tirupath

@chilo-ms Could you please trigger CI ?

Oct 17 '25 16:10 quic-tirupath

/azp run Windows ARM64 QNN CI Pipeline

Oct 17 '25 22:10 edgchen1

Azure Pipelines successfully started running 1 pipeline(s).

Oct 17 '25 22:10 azure-pipelines[bot]

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows GPU Doc Gen CI Pipeline

Oct 21 '25 00:10 edgchen1

Azure Pipelines successfully started running 3 pipeline(s).

Oct 21 '25 00:10 azure-pipelines[bot]

@edgchen1 Please holdoff on merging this pull request. There might be an issue with this change.

Oct 21 '25 12:10 johnpaultaken

We are seeing NaN outputs for Qwen and DeepSeek 1.5B using MatmulNBits that was triaged to this change.

Oct 21 '25 12:10 johnpaultaken

We are seeing NaN outputs for Qwen and DeepSeek 1.5B using MatmulNBits that was triaged to this change.

The NaN issue is identified and fixed from Qnn Gpu backend. Confirmed that there are no issues with this PR

Nov 03 '25 04:11 skadaver-qti

We are seeing NaN outputs for Qwen and DeepSeek 1.5B using MatmulNBits that was triaged to this change.

The NaN issue is identified and fixed from Qnn Gpu backend. Confirmed that there are no issues with this PR

I tested with the next release of Qnn Gpu 2.40. The issue still seems to be present. Lets discuss offline and clarify things before this change is merged.

Nov 03 '25 07:11 johnpaultaken

@edgchen1 Please holdoff on merging this pull request. There might be an issue with this change.

@edgchen1 Thanks for holding off. This issue is verified as fixed with QNN SDK 2.41. Please procced with the merge.

Nov 19 '25 01:11 johnpaultaken

@edgchen1 Thanks for the review and suggestions. We addressed the comments and rebased the PR. Could you please kindly review and approve the PR. Please help to trigger CI as well.

Dec 13 '25 03:12 quic-tirupath

/azp run Linux QNN CI Pipeline,Windows ARM64 QNN CI Pipeline

Dec 16 '25 21:12 edgchen1

Command 'Linux' is not supported by Azure Pipelines. Supported commands

help:

Get descriptions, examples and documentation about supported commands
Example: help "command_name"

list:

List all pipelines for this repository using a comment.
Example: "list"

run:

Run all pipelines or specific pipelines for this repository using a comment. Use this command by itself to trigger all related pipelines, or specify specific pipelines to run.
Example: "run" or "run pipeline_name, pipeline_name, pipeline_name"

where:

Report back the Azure DevOps orgs that are related to this repository and org
Example: "where"

See additional documentation.

Dec 16 '25 21:12 azure-pipelines[bot]

Azure Pipelines successfully started running 2 pipeline(s).

Dec 16 '25 22:12 azure-pipelines[bot]

/azp run Linux QNN CI Pipeline,Windows ARM64 QNN CI Pipeline

Dec 16 '25 22:12 edgchen1

Azure Pipelines successfully started running 2 pipeline(s).

Dec 16 '25 22:12 azure-pipelines[bot]

/azp run Linux QNN CI Pipeline,Windows ARM64 QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows GPU Doc Gen CI Pipeline

Dec 18 '25 19:12 edgchen1

Azure Pipelines successfully started running 4 pipeline(s).

Dec 18 '25 19:12 azure-pipelines[bot]