onnxruntime icon indicating copy to clipboard operation
onnxruntime copied to clipboard

[Mobile] NNAPI not work even if the Model Usability Checker give a positive result

Open niedev opened this issue 1 year ago • 7 comments

Describe the issue

I converted madlad 3b (without kv-cache, divided into encoder and decoder) to onnx using the pytorch conversion tool (torch.onnx.export, with the axes fixed at 128) and performed the static quantization of the decoder in int8 (using hf optimum and leaving the Add, Softmax, Mul and Unsqueeze operators to fp32). Both the encoder (dynamically quantized) and the statically quantized decoder work perfectly with onnxruntime using CPU, also I wanted to use NNAPI on the quantized decoder, so I ran the Model Usability Checker tool on the decoder getting full compatibility with NNAPI:

image

I also verified that the operators mentioned in the caveats respected those conditions (and they respect them given that the quantization is static and the input axes are fixed).

However, running the same decoder on Android with NNAPI option:

OrtSession.SessionOptions decoderOptions = new OrtSession.SessionOptions();
EnumSet<NNAPIFlags> flags = EnumSet.of(NNAPIFlags.CPU_DISABLED);
decoderOptions.addNnapi(flags);
decoderOptions.setSessionLogLevel(OrtLoggingLevel.ORT_LOGGING_LEVEL_VERBOSE);
decoderOptions.setSessionLogVerbosityLevel(0);
decoderSession = onnxEnv.createSession(decoderPath, decoderOptions); 

I get much worse performance than running with CPU (around 700ms for each decoder run with CPU and 4000ms using NNAPI), by setting the session log to verbose I get the following result:

[W:onnxruntime:ort-java, nnapi_execution_provider.cc:225 GetCapability] NnapiExecutionProvider::GetCapability,
number of partitions supported by NNAPI: 322
number of nodes in the graph: 5067
number of nodes supported by NNAPI: 3126

In practice, the Model Usability Checker says that the model has only one partition supported by NNAPI and all nodes are supported, while the logger of the session performed on Android says that there are 322 partitions and that the number of supported partitions is 3126 out of 5067.

I tried using only the basic graph optimizations (like the Model Usability Checker does), I tried updating the op version from 11 to 20 with onnxruntime.tools.update_onnx_opset, converting the decoder to ort format, but nothing, the result is always the same.

To reproduce

Code of the static quantization:

def static_quantization_optimum():
    model_dir = "onnx/Madlad/Script"
    model_name = 'jbochi/madlad400-3b-mt'
    quantizer = ORTQuantizer.from_pretrained(model_dir, file_name="Madlad_decoder_complete.onnx")
    operators_to_quantize_in = ORT_DEFAULT_OPS_STATIC_QUANTIZATION_QOPS
    operators_to_quantize_in.remove('Add')
    operators_to_quantize_in.remove('Softmax')
    operators_to_quantize_in.remove('Mul')
    operators_to_quantize_in.remove('Unsqueeze')
    format, mode, operators_to_quantize = default_quantization_parameters(True,operators_to_quantize=operators_to_quantize_in)

    qconfig = QuantizationConfig(
        is_static=True,
        format=format,
        mode=mode,
        activations_dtype=QuantType.QUInt8,
        activations_symmetric=True,
        weights_dtype=QuantType.QInt8,
        weights_symmetric=True,
        per_channel=False,
        reduce_range=False,
        nodes_to_quantize=[],
        nodes_to_exclude=[],
        operators_to_quantize=operators_to_quantize,
    )
    # loading of the tokenizer
    tokenizerEn = T5Tokenizer.from_pretrained(model_name)
    # loading encoder session (to get the input "encoder_hidden_state" for the decoder)
    providers = ['CPUExecutionProvider']
    encoder_session = onnxruntime.InferenceSession("onnx/Madlad/Optimum/encoder_model.onnx",providers=providers)

    # Create the calibration dataset
    calibration_samples = 120   #38 per il preprocess_fn_test
    calibration_dataset = quantizer.get_calibration_dataset(
        "opus100",
        dataset_config_name="en-it",
        preprocess_function=functools.partial(preprocess_fn_Madlad, tokenizer=tokenizerEn,encoder_session=encoder_session),
        num_samples=calibration_samples,
        dataset_split="train",
        # preprocess_batch=False
    )

    calibration_config = AutoCalibrationConfig.entropy(calibration_dataset)

    # free the RAM of non useful resources
    del encoder_session

    # Perform the calibration step: computes the activations quantization ranges (RAM optimized)
    shards = 4  
    for i in range(shards):
        shard = calibration_dataset.shard(shards, i)
        quantizer.partial_fit(
            dataset=shard,
            calibration_config=calibration_config,
            operators_to_quantize=qconfig.operators_to_quantize,
            batch_size=1,   #calibration_samples//shards
            use_external_data_format=True,
        )
    ranges = quantizer.compute_ranges()

    # remove temp augmented model
    os.remove("augmented_model.onnx")

    model_quantized_path = quantizer.quantize(
        save_dir="onnx/Madlad/Script/StaticQuantization/",
        calibration_tensors_range=ranges,
        quantization_config=qconfig,
        use_external_data_format=True
    )

If you need the onnx model of the quantized decoder let me know, I can upload it to my github and put the link in the comments

Urgency

Not so urgent

Platform

Android

OS Version

14 (api 34)

ONNX Runtime Installation

Released Package

Compiler Version (if 'Built from Source')

No response

Package Name (if 'Released Package')

onnxruntime-android

ONNX Runtime Version or Commit ID

1.17

ONNX Runtime API

Java/Kotlin

Architecture

ARM64

Execution Provider

NNAPI

Execution Provider Library Version

No response

niedev avatar Feb 21 '24 14:02 niedev

If you set the default logger severity to VERBOSE what does it say about the unsupported nodes? This needs to be done when creating the environment - i.e. first call to getEnvironment should look something like OrtEnvironment.getEnvironment(OrtLoggingLevel.ORT_LOGGING_LEVEL_VERBOSE).

skottmckay avatar Feb 22 '24 05:02 skottmckay

Ok, maybe the problem is that the matrices should be 2d instead of 3d? (currently the batch size is fixed at 1, so the inner matrices are in 3d, of shape [1, 128, 1024]), here is the part of the log about the supported nodes (I have inserted only a part because the complete log was too long for a github comment):

log
[V:onnxruntime:ort-java, nnapi_execution_provider.cc:102 GetCapability] Effective NNAPI feature level: 1000008
11:36:31.321  V   [V:onnxruntime:ort-java, base_op_builder.cc:146 HasSupportedInputOutputsImpl] [Reshape] Input type: [7] is not supported for now
11:36:31.321  V   [V:onnxruntime:ort-java, nnapi_execution_provider.cc:151 operator()] Node supported: [0] Operator type: [Reshape] index: [358] name: [/Reshape] as part of the NodeUnit type: [Reshape] index: [358] name: [/Reshape]
11:36:31.321  V   [V:onnxruntime:ort-java, base_op_builder.cc:170 IsNodeUnitTypeSupported] QDQ NodeUnit [Gather] is not supported for now
11:36:31.321  V   [V:onnxruntime:ort-java, nnapi_execution_provider.cc:151 operator()] Node supported: [0] Operator type: [DequantizeLinear] index: [2] name: [embed_tokens.weight_DequantizeLinear] as part of the NodeUnit type: [Gather] index: [364] name: [/embed_tokens/Gather]
11:36:31.321  V   [V:onnxruntime:ort-java, op_builder_helpers.cc:236 IsSupportedBatchMatMul] Unsupported op type: QDQ MatMul
11:36:31.321  V   [V:onnxruntime:ort-java, gemm_op_builder.cc:304 IsOpSupportedImpl] Supported batch matmul: [0]
11:36:31.321  V   [V:onnxruntime:ort-java, gemm_op_builder.cc:322 IsOpSupportedImpl] A must be 2D
11:36:31.321  V   [V:onnxruntime:ort-java, nnapi_execution_provider.cc:151 operator()] Node supported: [0] Operator type: [DequantizeLinear] index: [6] name: [onnx::MatMul_4989_DequantizeLinear] as part of the NodeUnit type: [MatMul] index: [982] name: [/block.0/layer.0/SelfAttention/k/MatMul]
11:36:31.321  V   [V:onnxruntime:ort-java, op_builder_helpers.cc:236 IsSupportedBatchMatMul] Unsupported op type: QDQ MatMul
11:36:31.321  V   [V:onnxruntime:ort-java, gemm_op_builder.cc:304 IsOpSupportedImpl] Supported batch matmul: [0]
11:36:31.321  V   [V:onnxruntime:ort-java, gemm_op_builder.cc:322 IsOpSupportedImpl] A must be 2D
11:36:31.321  V   [V:onnxruntime:ort-java, nnapi_execution_provider.cc:151 operator()] Node supported: [0] Operator type: [DequantizeLinear] index: [5] name: [onnx::MatMul_4973_DequantizeLinear] as part of the NodeUnit type: [MatMul] index: [981] name: [/block.0/layer.0/SelfAttention/q/MatMul]
11:36:31.321  V   [V:onnxruntime:ort-java, base_op_builder.cc:170 IsNodeUnitTypeSupported] QDQ NodeUnit [Gather] is not supported for now
11:36:31.321  V   [V:onnxruntime:ort-java, nnapi_execution_provider.cc:151 operator()] Node supported: [0] Operator type: [DequantizeLinear] index: [1] name: [block.0.layer.0.SelfAttention.relative_attention_bias.weight_DequantizeLinear] as part of the NodeUnit type: [Gather] index: [984] name: [/block.0/layer.0/SelfAttention/relative_attention_bias/Gather]
11:36:31.321  V   [V:onnxruntime:ort-java, base_op_builder.cc:146 HasSupportedInputOutputsImpl] [Unsqueeze] Input type: [7] is not supported for now
11:36:31.321  V   [V:onnxruntime:ort-java, nnapi_execution_provider.cc:151 operator()] Node supported: [0] Operator type: [Unsqueeze] index: [0] name: [/Unsqueeze_1] as part of the NodeUnit type: [Unsqueeze] index: [0] name: [/Unsqueeze_1]
11:36:31.321  V   [V:onnxruntime:ort-java, op_builder_helpers.cc:236 IsSupportedBatchMatMul] Unsupported op type: QDQ MatMul
11:36:31.321  V   [V:onnxruntime:ort-java, gemm_op_builder.cc:304 IsOpSupportedImpl] Supported batch matmul: [0]
11:36:31.321  V   [V:onnxruntime:ort-java, gemm_op_builder.cc:322 IsOpSupportedImpl] A must be 2D
11:36:31.321  V   [V:onnxruntime:ort-java, nnapi_execution_provider.cc:151 operator()] Node supported: [0] Operator type: [DequantizeLinear] index: [7] name: [onnx::MatMul_4990_DequantizeLinear] as part of the NodeUnit type: [MatMul] index: [983] name: [/block.0/layer.0/SelfAttention/v/MatMul]
11:36:31.321  V   [V:onnxruntime:ort-java, op_builder_helpers.cc:236 IsSupportedBatchMatMul] Unsupported op type: QDQ MatMul
11:36:31.321  V   [V:onnxruntime:ort-java, gemm_op_builder.cc:304 IsOpSupportedImpl] Supported batch matmul: [0]
11:36:31.321  V   [V:onnxruntime:ort-java, gemm_op_builder.cc:322 IsOpSupportedImpl] A must be 2D
11:36:31.321  V   [V:onnxruntime:ort-java, nnapi_execution_provider.cc:151 operator()] Node supported: [0] Operator type: [DequantizeLinear] index: [8] name: [onnx::MatMul_4998_DequantizeLinear] as part of the NodeUnit type: [MatMul] index: [1033] name: [/block.0/layer.0/SelfAttention/o/MatMul]
11:36:31.321  V   [V:onnxruntime:ort-java, op_builder_helpers.cc:236 IsSupportedBatchMatMul] Unsupported op type: QDQ MatMul
11:36:31.321  V   [V:onnxruntime:ort-java, gemm_op_builder.cc:304 IsOpSupportedImpl] Supported batch matmul: [0]
11:36:31.321  V   [V:onnxruntime:ort-java, gemm_op_builder.cc:322 IsOpSupportedImpl] A must be 2D
11:36:31.321  V   [V:onnxruntime:ort-java, nnapi_execution_provider.cc:151 operator()] Node supported: [0] Operator type: [DequantizeLinear] index: [10] name: [onnx::MatMul_5000_DequantizeLinear] as part of the NodeUnit type: [MatMul] index: [1047] name: [/block.0/layer.1/EncDecAttention/q/MatMul]
11:36:31.321  V   [V:onnxruntime:ort-java, nnapi_execution_provider.cc:151 operator()] Node supported: [1] Operator type: [QuantizeLinear] index: [4] name: [encoder_hidden_states_QuantizeLinear] as part of the NodeUnit type: [QuantizeLinear] index: [4] name: [encoder_hidden_states_QuantizeLinear]
11:36:31.321  V   [V:onnxruntime:ort-java, op_builder_helpers.cc:236 IsSupportedBatchMatMul] Unsupported op type: QDQ MatMul
11:36:31.321  V   [V:onnxruntime:ort-java, gemm_op_builder.cc:304 IsOpSupportedImpl] Supported batch matmul: [0]
11:36:31.321  V   [V:onnxruntime:ort-java, gemm_op_builder.cc:322 IsOpSupportedImpl] A must be 2D
11:36:31.321  V   [V:onnxruntime:ort-java, nnapi_execution_provider.cc:151 operator()] Node supported: [0] Operator type: [DequantizeLinear] index: [11] name: [onnx::MatMul_5016_DequantizeLinear] as part of the NodeUnit type: [MatMul] index: [368] name: [/block.0/layer.1/EncDecAttention/k/MatMul]
11:36:31.321  V   [V:onnxruntime:ort-java, base_op_builder.cc:146 HasSupportedInputOutputsImpl] [Unsqueeze] Input type: [7] is not supported for now
11:36:31.321  V   [V:onnxruntime:ort-java, nnapi_execution_provider.cc:151 operator()] Node supported: [0] Operator type: [Unsqueeze] index: [3] name: [/Unsqueeze_3] as part of the NodeUnit type: [Unsqueeze] index: [3] name: [/Unsqueeze_3]
11:36:31.321  V   [V:onnxruntime:ort-java, op_builder_helpers.cc:236 IsSupportedBatchMatMul] Unsupported op type: QDQ MatMul
11:36:31.321  V   [V:onnxruntime:ort-java, gemm_op_builder.cc:304 IsOpSupportedImpl] Supported batch matmul: [0]
11:36:31.321  V   [V:onnxruntime:ort-java, gemm_op_builder.cc:322 IsOpSupportedImpl] A must be 2D
11:36:31.321  V   [V:onnxruntime:ort-java, nnapi_execution_provider.cc:151 operator()] Node supported: [0] Operator type: [DequantizeLinear] index: [12] name: [onnx::MatMul_5017_DequantizeLinear] as part of the NodeUnit type: [MatMul] index: [369] name: [/block.0/layer.1/EncDecAttention/v/MatMul]
11:36:31.321  V   [V:onnxruntime:ort-java, op_builder_helpers.cc:236 IsSupportedBatchMatMul] Unsupported op type: QDQ MatMul
11:36:31.321  V   [V:onnxruntime:ort-java, gemm_op_builder.cc:304 IsOpSupportedImpl] Supported batch matmul: [0]
11:36:31.321  V   [V:onnxruntime:ort-java, gemm_op_builder.cc:322 IsOpSupportedImpl] A must be 2D
11:36:31.321  V   [V:onnxruntime:ort-java, nnapi_execution_provider.cc:151 operator()] Node supported: [0] Operator type: [DequantizeLinear] index: [13] name: [onnx::MatMul_5022_DequantizeLinear] as part of the NodeUnit type: [MatMul] index: [1073] name: [/block.0/layer.1/EncDecAttention/o/MatMul]
11:36:31.321  V   [V:onnxruntime:ort-java, op_builder_helpers.cc:236 IsSupportedBatchMatMul] Unsupported op type: QDQ MatMul
11:36:31.321  V   [V:onnxruntime:ort-java, gemm_op_builder.cc:304 IsOpSupportedImpl] Supported batch matmul: [0]
11:36:31.321  V   [V:onnxruntime:ort-java, gemm_op_builder.cc:322 IsOpSupportedImpl] A must be 2D
11:36:31.321  V   [V:onnxruntime:ort-java, nnapi_execution_provider.cc:151 operator()] Node supported: [0] Operator type: [DequantizeLinear] index: [14] name: [onnx::MatMul_5024_DequantizeLinear] as part of the NodeUnit type: [MatMul] index: [1087] name: [/block.0/layer.2/DenseReluDense/wi_0/MatMul]
11:36:31.321  V   [V:onnxruntime:ort-java, op_builder_helpers.cc:236 IsSupportedBatchMatMul] Unsupported op type: QDQ MatMul
11:36:31.321  V   [V:onnxruntime:ort-java, gemm_op_builder.cc:304 IsOpSupportedImpl] Supported batch matmul: [0]
11:36:31.321  V   [V:onnxruntime:ort-java, gemm_op_builder.cc:322 IsOpSupportedImpl] A must be 2D
11:36:31.321  V   [V:onnxruntime:ort-java, nnapi_execution_provider.cc:151 operator()] Node supported: [0] Operator type: [DequantizeLinear] index: [15] name: [onnx::MatMul_5025_DequantizeLinear] as part of the NodeUnit type: [MatMul] index: [1088] name: [/block.0/layer.2/DenseReluDense/wi_1/MatMul]
11:36:31.321  V   [V:onnxruntime:ort-java, op_builder_helpers.cc:236 IsSupportedBatchMatMul] Unsupported op type: QDQ MatMul
11:36:31.321  V   [V:onnxruntime:ort-java, gemm_op_builder.cc:304 IsOpSupportedImpl] Supported batch matmul: [0]
11:36:31.321  V   [V:onnxruntime:ort-java, gemm_op_builder.cc:322 IsOpSupportedImpl] A must be 2D
11:36:31.321  V   [V:onnxruntime:ort-java, nnapi_execution_provider.cc:151 operator()] Node supported: [0] Operator type: [DequantizeLinear] index: [16] name: [onnx::MatMul_5026_DequantizeLinear] as part of the NodeUnit type: [MatMul] index: [1104] name: [/block.0/layer.2/DenseReluDense/wo/MatMul]
11:36:31.321  V   [V:onnxruntime:ort-java, op_builder_helpers.cc:236 IsSupportedBatchMatMul] Unsupported op type: QDQ MatMul
11:36:31.321  V   [V:onnxruntime:ort-java, gemm_op_builder.cc:304 IsOpSupportedImpl] Supported batch matmul: [0]
11:36:31.321  V   [V:onnxruntime:ort-java, gemm_op_builder.cc:322 IsOpSupportedImpl] A must be 2D
11:36:31.321  V   [V:onnxruntime:ort-java, nnapi_execution_provider.cc:151 operator()] Node supported: [0] Operator type: [DequantizeLinear] index: [18] name: [onnx::MatMul_5044_DequantizeLinear] as part of the NodeUnit type: [MatMul] index: [1119] name: [/block.1/layer.0/SelfAttention/k/MatMul]
11:36:31.321  V   [V:onnxruntime:ort-java, op_builder_helpers.cc:236 IsSupportedBatchMatMul] Unsupported op type: QDQ MatMul
11:36:31.321  V   [V:onnxruntime:ort-java, gemm_op_builder.cc:304 IsOpSupportedImpl] Supported batch matmul: [0]
11:36:31.321  V   [V:onnxruntime:ort-java, gemm_op_builder.cc:322 IsOpSupportedImpl] A must be 2D
11:36:31.321  V   [V:onnxruntime:ort-java, nnapi_execution_provider.cc:151 operator()] Node supported: [0] Operator type: [DequantizeLinear] index: [17] name: [onnx::MatMul_5028_DequantizeLinear] as part of the NodeUnit type: [MatMul] index: [1118] name: [/block.1/layer.0/SelfAttention/q/MatMul]
11:36:31.321  V   [V:onnxruntime:ort-java, op_builder_helpers.cc:236 IsSupportedBatchMatMul] Unsupported op type: QDQ MatMul
11:36:31.321  V   [V:onnxruntime:ort-java, gemm_op_builder.cc:304 IsOpSupportedImpl] Supported batch matmul: [0]
11:36:31.321  V   [V:onnxruntime:ort-java, gemm_op_builder.cc:322 IsOpSupportedImpl] A must be 2D
11:36:31.321  V   [V:onnxruntime:ort-java, nnapi_execution_provider.cc:151 operator()] Node supported: [0] Operator type: [DequantizeLinear] index: [19] name: [onnx::MatMul_5045_DequantizeLinear] as part of the NodeUnit type: [MatMul] index: [1120] name: [/block.1/layer.0/SelfAttention/v/MatMul]
11:36:31.321  V   [V:onnxruntime:ort-java, op_builder_helpers.cc:236 IsSupportedBatchMatMul] Unsupported op type: QDQ MatMul
11:36:31.321  V   [V:onnxruntime:ort-java, gemm_op_builder.cc:304 IsOpSupportedImpl] Supported batch matmul: [0]
11:36:31.321  V   [V:onnxruntime:ort-java, gemm_op_builder.cc:322 IsOpSupportedImpl] A must be 2D
11:36:31.321  V   [V:onnxruntime:ort-java, nnapi_execution_provider.cc:151 operator()] Node supported: [0] Operator type: [DequantizeLinear] index: [20] name: [onnx::MatMul_5050_DequantizeLinear] as part of the NodeUnit type: [MatMul] index: [1162] name: [/block.1/layer.0/SelfAttention/o/MatMul]
11:36:31.321  V   [V:onnxruntime:ort-java, op_builder_helpers.cc:236 IsSupportedBatchMatMul] Unsupported op type: QDQ MatMul
11:36:31.321  V   [V:onnxruntime:ort-java, gemm_op_builder.cc:304 IsOpSupportedImpl] Supported batch matmul: [0]
11:36:31.321  V   [V:onnxruntime:ort-java, gemm_op_builder.cc:322 IsOpSupportedImpl] A must be 2D
11:36:31.321  V   [V:onnxruntime:ort-java, nnapi_execution_provider.cc:151 operator()] Node supported: [0] Operator type: [DequantizeLinear] index: [21] name: [onnx::MatMul_5052_DequantizeLinear] as part of the NodeUnit type: [MatMul] index: [1176] name: [/block.1/layer.1/EncDecAttention/q/MatMul]
11:36:31.321  V   [V:onnxruntime:ort-java, op_builder_helpers.cc:236 IsSupportedBatchMatMul] Unsupported op type: QDQ MatMul
11:36:31.321  V   [V:onnxruntime:ort-java, gemm_op_builder.cc:304 IsOpSupportedImpl] Supported batch matmul: [0]
11:36:31.321  V   [V:onnxruntime:ort-java, gemm_op_builder.cc:322 IsOpSupportedImpl] A must be 2D
11:36:31.321  V   [V:onnxruntime:ort-java, nnapi_execution_provider.cc:151 operator()] Node supported: [0] Operator type: [DequantizeLinear] index: [22] name: [onnx::MatMul_5068_DequantizeLinear] as part of the NodeUnit type: [MatMul] index: [370] name: [/block.1/layer.1/EncDecAttention/k/MatMul]
11:36:31.321  V   [V:onnxruntime:ort-java, op_builder_helpers.cc:236 IsSupportedBatchMatMul] Unsupported op type: QDQ MatMul
11:36:31.321  V   [V:onnxruntime:ort-java, gemm_op_builder.cc:304 IsOpSupportedImpl] Supported batch matmul: [0]
11:36:31.321  V   [V:onnxruntime:ort-java, gemm_op_builder.cc:322 IsOpSupportedImpl] A must be 2D
11:36:31.321  V   [V:onnxruntime:ort-java, nnapi_execution_provider.cc:151 operator()] Node supported: [0] Operator type: [DequantizeLinear] index: [23] name: [onnx::MatMul_5069_DequantizeLinear] as part of the NodeUnit type: [MatMul] index: [371] name: [/block.1/layer.1/EncDecAttention/v/MatMul]
11:36:31.321  V   [V:onnxruntime:ort-java, op_builder_helpers.cc:236 IsSupportedBatchMatMul] Unsupported op type: QDQ MatMul
11:36:31.321  V   [V:onnxruntime:ort-java, gemm_op_builder.cc:304 IsOpSupportedImpl] Supported batch matmul: [0]
11:36:31.321  V   [V:onnxruntime:ort-java, gemm_op_builder.cc:322 IsOpSupportedImpl] A must be 2D
11:36:31.321  V   [V:onnxruntime:ort-java, nnapi_execution_provider.cc:151 operator()] Node supported: [0] Operator type: [DequantizeLinear] index: [24] name: [onnx::MatMul_5074_DequantizeLinear] as part of the NodeUnit type: [MatMul] index: [1202] name: [/block.1/layer.1/EncDecAttention/o/MatMul]
11:36:31.321  V   [V:onnxruntime:ort-java, op_builder_helpers.cc:236 IsSupportedBatchMatMul] Unsupported op type: QDQ MatMul
11:36:31.321  V   [V:onnxruntime:ort-java, gemm_op_builder.cc:304 IsOpSupportedImpl] Supported batch matmul: [0]
11:36:31.321  V   [V:onnxruntime:ort-java, gemm_op_builder.cc:322 IsOpSupportedImpl] A must be 2D
11:36:31.321  V   [V:onnxruntime:ort-java, nnapi_execution_provider.cc:151 operator()] Node supported: [0] Operator type: [DequantizeLinear] index: [25] name: [onnx::MatMul_5076_DequantizeLinear] as part of the NodeUnit type: [MatMul] index: [1216] name: [/block.1/layer.2/DenseReluDense/wi_0/MatMul]
11:36:31.321  V   [V:onnxruntime:ort-java, op_builder_helpers.cc:236 IsSupportedBatchMatMul] Unsupported op type: QDQ MatMul
11:36:31.321  V   [V:onnxruntime:ort-java, gemm_op_builder.cc:304 IsOpSupportedImpl] Supported batch matmul: [0]
11:36:31.321  V   [V:onnxruntime:ort-java, gemm_op_builder.cc:322 IsOpSupportedImpl] A must be 2D
11:36:31.321  V   [V:onnxruntime:ort-java, nnapi_execution_provider.cc:151 operator()] Node supported: [0] Operator type: [DequantizeLinear] index: [26] name: [onnx::MatMul_5077_DequantizeLinear] as part of the NodeUnit type: [MatMul] index: [1217] name: [/block.1/layer.2/DenseReluDense/wi_1/MatMul]
11:36:31.321  V   [V:onnxruntime:ort-java, op_builder_helpers.cc:236 IsSupportedBatchMatMul] Unsupported op type: QDQ MatMul
11:36:31.321  V   [V:onnxruntime:ort-java, gemm_op_builder.cc:304 IsOpSupportedImpl] Supported batch matmul: [0]
11:36:31.321  V   [V:onnxruntime:ort-java, gemm_op_builder.cc:322 IsOpSupportedImpl] A must be 2D
11:36:31.321  V   [V:onnxruntime:ort-java, nnapi_execution_provider.cc:151 operator()] Node supported: [0] Operator type: [DequantizeLinear] index: [27] name: [onnx::MatMul_5078_DequantizeLinear] as part of the NodeUnit type: [MatMul] index: [1233] name: [/block.1/layer.2/DenseReluDense/wo/MatMul]
11:36:31.321  V   [V:onnxruntime:ort-java, op_builder_helpers.cc:236 IsSupportedBatchMatMul] Unsupported op type: QDQ MatMul
11:36:31.321  V   [V:onnxruntime:ort-java, gemm_op_builder.cc:304 IsOpSupportedImpl] Supported batch matmul: [0]
11:36:31.321  V   [V:onnxruntime:ort-java, gemm_op_builder.cc:322 IsOpSupportedImpl] A must be 2D
11:36:31.321  V   [V:onnxruntime:ort-java, nnapi_execution_provider.cc:151 operator()] Node supported: [0] Operator type: [DequantizeLinear] index: [29] name: [onnx::MatMul_5096_DequantizeLinear] as part of the NodeUnit type: [MatMul] index: [1248] name: [/block.2/layer.0/SelfAttention/k/MatMul]
11:36:31.321  V   [V:onnxruntime:ort-java, op_builder_helpers.cc:236 IsSupportedBatchMatMul] Unsupported op type: QDQ MatMul
11:36:31.321  V   [V:onnxruntime:ort-java, gemm_op_builder.cc:304 IsOpSupportedImpl] Supported batch matmul: [0]
11:36:31.321  V   [V:onnxruntime:ort-java, gemm_op_builder.cc:322 IsOpSupportedImpl] A must be 2D
11:36:31.321  V   [V:onnxruntime:ort-java, nnapi_execution_provider.cc:151 operator()] Node supported: [0] Operator type: [DequantizeLinear] index: [28] name: [onnx::MatMul_5080_DequantizeLinear] as part of the NodeUnit type: [MatMul] index: [1247] name: [/block.2/layer.0/SelfAttention/q/MatMul]
11:36:31.321  V   [V:onnxruntime:ort-java, op_builder_helpers.cc:236 IsSupportedBatchMatMul] Unsupported op type: QDQ MatMul
11:36:31.321  V   [V:onnxruntime:ort-java, gemm_op_builder.cc:304 IsOpSupportedImpl] Supported batch matmul: [0]
11:36:31.321  V   [V:onnxruntime:ort-java, gemm_op_builder.cc:322 IsOpSupportedImpl] A must be 2D
11:36:31.321  V   [V:onnxruntime:ort-java, nnapi_execution_provider.cc:151 operator()] Node supported: [0] Operator type: [DequantizeLinear] index: [30] name: [onnx::MatMul_5097_DequantizeLinear] as part of the NodeUnit type: [MatMul] index: [1249] name: [/block.2/layer.0/SelfAttention/v/MatMul]
11:36:31.321  V   [V:onnxruntime:ort-java, op_builder_helpers.cc:236 IsSupportedBatchMatMul] Unsupported op type: QDQ MatMul
11:36:31.321  V   [V:onnxruntime:ort-java, gemm_op_builder.cc:304 IsOpSupportedImpl] Supported batch matmul: [0]
11:36:31.321  V   [V:onnxruntime:ort-java, gemm_op_builder.cc:322 IsOpSupportedImpl] A must be 2D
11:36:31.321  V   [V:onnxruntime:ort-java, nnapi_execution_provider.cc:151 operator()] Node supported: [0] Operator type: [DequantizeLinear] index: [31] name: [onnx::MatMul_5102_DequantizeLinear] as part of the NodeUnit type: [MatMul] index: [1291] name: [/block.2/layer.0/SelfAttention/o/MatMul]
11:36:31.321  V   [V:onnxruntime:ort-java, op_builder_helpers.cc:236 IsSupportedBatchMatMul] Unsupported op type: QDQ MatMul
11:36:31.321  V   [V:onnxruntime:ort-java, gemm_op_builder.cc:304 IsOpSupportedImpl] Supported batch matmul: [0]
11:36:31.321  V   [V:onnxruntime:ort-java, gemm_op_builder.cc:322 IsOpSupportedImpl] A must be 2D
11:36:31.321  V   [V:onnxruntime:ort-java, nnapi_execution_provider.cc:151 operator()] Node supported: [0] Operator type: [DequantizeLinear] index: [32] name: [onnx::MatMul_5104_DequantizeLinear] as part of the NodeUnit type: [MatMul] index: [1305] name: [/block.2/layer.1/EncDecAttention/q/MatMul]
11:36:31.321  V   [V:onnxruntime:ort-java, op_builder_helpers.cc:236 IsSupportedBatchMatMul] Unsupported op type: QDQ MatMul
11:36:31.321  V   [V:onnxruntime:ort-java, gemm_op_builder.cc:304 IsOpSupportedImpl] Supported batch matmul: [0]
11:36:31.321  V   [V:onnxruntime:ort-java, gemm_op_builder.cc:322 IsOpSupportedImpl] A must be 2D
11:36:31.321  V   [V:onnxruntime:ort-java, nnapi_execution_provider.cc:151 operator()] Node supported: [0] Operator type: [DequantizeLinear] index: [33] name: [onnx::MatMul_5120_DequantizeLinear] as part of the NodeUnit type: [MatMul] index: [372] name: [/block.2/layer.1/EncDecAttention/k/MatMul]
11:36:31.321  V   [V:onnxruntime:ort-java, op_builder_helpers.cc:236 IsSupportedBatchMatMul] Unsupported op type: QDQ MatMul
11:36:31.321  V   [V:onnxruntime:ort-java, gemm_op_builder.cc:304 IsOpSupportedImpl] Supported batch matmul: [0]
11:36:31.321  V   [V:onnxruntime:ort-java, gemm_op_builder.cc:322 IsOpSupportedImpl] A must be 2D
11:36:31.321  V   [V:onnxruntime:ort-java, nnapi_execution_provider.cc:151 operator()] Node supported: [0] Operator type: [DequantizeLinear] index: [34] name: [onnx::MatMul_5121_DequantizeLinear] as part of the NodeUnit type: [MatMul] index: [373] name: [/block.2/layer.1/EncDecAttention/v/MatMul]
11:36:31.321  V   [V:onnxruntime:ort-java, op_builder_helpers.cc:236 IsSupportedBatchMatMul] Unsupported op type: QDQ MatMul
11:36:31.321  V   [V:onnxruntime:ort-java, gemm_op_builder.cc:304 IsOpSupportedImpl] Supported batch matmul: [0]
11:36:31.321  V   [V:onnxruntime:ort-java, gemm_op_builder.cc:322 IsOpSupportedImpl] A must be 2D
11:36:31.321  V   [V:onnxruntime:ort-java, nnapi_execution_provider.cc:151 operator()] Node supported: [0] Operator type: [DequantizeLinear] index: [35] name: [onnx::MatMul_5126_DequantizeLinear] as part of the NodeUnit type: [MatMul] index: [1331] name: [/block.2/layer.1/EncDecAttention/o/MatMul]
11:36:31.321  V   [V:onnxruntime:ort-java, op_builder_helpers.cc:236 IsSupportedBatchMatMul] Unsupported op type: QDQ MatMul
11:36:31.321  V   [V:onnxruntime:ort-java, gemm_op_builder.cc:304 IsOpSupportedImpl] Supported batch matmul: [0]
11:36:31.321  V   [V:onnxruntime:ort-java, gemm_op_builder.cc:322 IsOpSupportedImpl] A must be 2D
11:36:31.321  V   [V:onnxruntime:ort-java, nnapi_execution_provider.cc:151 operator()] Node supported: [0] Operator type: [DequantizeLinear] index: [36] name: [onnx::MatMul_5128_DequantizeLinear] as part of the NodeUnit type: [MatMul] index: [1345] name: [/block.2/layer.2/DenseReluDense/wi_0/MatMul]
11:36:31.321  V   [V:onnxruntime:ort-java, op_builder_helpers.cc:236 IsSupportedBatchMatMul] Unsupported op type: QDQ MatMul
11:36:31.321  V   [V:onnxruntime:ort-java, gemm_op_builder.cc:304 IsOpSupportedImpl] Supported batch matmul: [0]
11:36:31.321  V   [V:onnxruntime:ort-java, gemm_op_builder.cc:322 IsOpSupportedImpl] A must be 2D
11:36:31.321  V   [V:onnxruntime:ort-java, nnapi_execution_provider.cc:151 operator()] Node supported: [0] Operator type: [DequantizeLinear] index: [37] name: [onnx::MatMul_5129_DequantizeLinear] as part of the NodeUnit type: [MatMul] index: [1346] name: [/block.2/layer.2/DenseReluDense/wi_1/MatMul]
11:36:31.321  V   [V:onnxruntime:ort-java, op_builder_helpers.cc:236 IsSupportedBatchMatMul] Unsupported op type: QDQ MatMul
11:36:31.321  V   [V:onnxruntime:ort-java, gemm_op_builder.cc:304 IsOpSupportedImpl] Supported batch matmul: [0]
11:36:31.321  V   [V:onnxruntime:ort-java, gemm_op_builder.cc:322 IsOpSupportedImpl] A must be 2D
[V:onnxruntime:ort-java, nnapi_execution_provider.cc:151 operator()] Node supported: [0] Operator type: [MatMul] index: [5103] name: [/block.31/layer.2/DenseReluDense/wo/MatMul] as part of the NodeUnit type: [MatMul] index: [5103] name: [/block.31/layer.2/DenseReluDense/wo/MatMul]
11:36:31.369  V   [V:onnxruntime:ort-java, nnapi_execution_provider.cc:151 operator()] Node supported: [0] Operator type: [QuantizeLinear] index: [5104] name: [/block.31/layer.2/DenseReluDense/wo/MatMul_output_0_QuantizeLinear] as part of the NodeUnit type: [MatMul] index: [5103] name: [/block.31/layer.2/DenseReluDense/wo/MatMul]
11:36:31.369  V   [V:onnxruntime:ort-java, nnapi_execution_provider.cc:151 operator()] Node supported: [1] Operator type: [DequantizeLinear] index: [5105] name: [/block.31/layer.2/DenseReluDense/wo/MatMul_output_0_DequantizeLinear] as part of the NodeUnit type: [DequantizeLinear] index: [5105] name: [/block.31/layer.2/DenseReluDense/wo/MatMul_output_0_DequantizeLinear]
11:36:31.369  V   [V:onnxruntime:ort-java, nnapi_execution_provider.cc:151 operator()] Node supported: [1] Operator type: [Add] index: [5106] name: [/block.31/layer.2/Add] as part of the NodeUnit type: [Add] index: [5106] name: [/block.31/layer.2/Add]
11:36:31.369  V   [V:onnxruntime:ort-java, nnapi_execution_provider.cc:151 operator()] Node supported: [1] Operator type: [Pow] index: [5108] name: [/final_layer_norm/Pow] as part of the NodeUnit type: [Pow] index: [5108] name: [/final_layer_norm/Pow]
11:36:31.369  V   [V:onnxruntime:ort-java, nnapi_execution_provider.cc:151 operator()] Node supported: [1] Operator type: [ReduceMean] index: [5109] name: [/final_layer_norm/ReduceMean] as part of the NodeUnit type: [ReduceMean] index: [5109] name: [/final_layer_norm/ReduceMean]
11:36:31.369  V   [V:onnxruntime:ort-java, nnapi_execution_provider.cc:151 operator()] Node supported: [1] Operator type: [Add] index: [5110] name: [/final_layer_norm/Add] as part of the NodeUnit type: [Add] index: [5110] name: [/final_layer_norm/Add]
11:36:31.369  V   [V:onnxruntime:ort-java, nnapi_execution_provider.cc:151 operator()] Node supported: [1] Operator type: [Sqrt] index: [5111] name: [/final_layer_norm/Sqrt] as part of the NodeUnit type: [Sqrt] index: [5111] name: [/final_layer_norm/Sqrt]
11:36:31.369  V   [V:onnxruntime:ort-java, nnapi_execution_provider.cc:151 operator()] Node supported: [1] Operator type: [Div] index: [5112] name: [/final_layer_norm/Div] as part of the NodeUnit type: [Div] index: [5112] name: [/final_layer_norm/Div]
11:36:31.369  V   [V:onnxruntime:ort-java, nnapi_execution_provider.cc:151 operator()] Node supported: [1] Operator type: [Mul] index: [5114] name: [/final_layer_norm/Mul_1] as part of the NodeUnit type: [Mul] index: [5114] name: [/final_layer_norm/Mul_1]
11:36:31.369  V   [V:onnxruntime:ort-java, nnapi_execution_provider.cc:151 operator()] Node supported: [1] Operator type: [QuantizeLinear] index: [5115] name: [last_hidden_state_QuantizeLinear] as part of the NodeUnit type: [QuantizeLinear] index: [5115] name: [last_hidden_state_QuantizeLinear]
11:36:31.369  V   [V:onnxruntime:ort-java, nnapi_execution_provider.cc:151 operator()] Node supported: [0] Operator type: [DequantizeLinear] index: [5116] name: [last_hidden_state_DequantizeLinear] as part of the NodeUnit type: [MatMul] index: [5117] name: [/MatMul]
11:36:31.369  V   [V:onnxruntime:ort-java, nnapi_execution_provider.cc:151 operator()] Node supported: [0] Operator type: [DequantizeLinear] index: [5116] name: [last_hidden_state_DequantizeLinear] as part of the NodeUnit type: [MatMul] index: [5117] name: [/MatMul]
11:36:31.369  V   [V:onnxruntime:ort-java, nnapi_execution_provider.cc:151 operator()] Node supported: [0] Operator type: [MatMul] index: [5117] name: [/MatMul] as part of the NodeUnit type: [MatMul] index: [5117] name: [/MatMul]
11:36:31.369  V   [V:onnxruntime:ort-java, nnapi_execution_provider.cc:151 operator()] Node supported: [0] Operator type: [QuantizeLinear] index: [5118] name: [logits_QuantizeLinear] as part of the NodeUnit type: [MatMul] index: [5117] name: [/MatMul]
11:36:31.369  V   [V:onnxruntime:ort-java, nnapi_execution_provider.cc:151 operator()] Node supported: [1] Operator type: [DequantizeLinear] index: [5119] name: [logits_DequantizeLinear] as part of the NodeUnit type: [DequantizeLinear] index: [5119] name: [logits_DequantizeLinear]
NnapiExecutionProvider::GetCapability, number of partitions supported by NNAPI: 322 number of nodes in the graph: 5067 number of nodes supported by NNAPI: 3126

niedev avatar Feb 22 '24 11:02 niedev

A similar thing happens even with the fp32 decoder simply converted to fp16:

image

NnapiExecutionProvider::GetCapability, number of partitions supported by NNAPI: 1 number of nodes in the graph: 2119 number of nodes supported by NNAPI: 1

niedev avatar Feb 22 '24 14:02 niedev

Changing the data type shouldn't change the rank of any values so the 3D input will still be 3D.

The usability checker is intended as a rough guide. It's in python, and replicating all the low level checks as well as keeping them up to date with the C++ implementation that determines if the node can be supported or not would have a lot of cost for minimal benefit. FWIW we have some upcoming changes to the NNAPI EP logging so that all the messages are from the same logger (so no need to set log level to verbose in the environment and the session). That should make the reason a node is not supported easier to find at least.

The NNAPI handling for a 3D input (as per this) appears to be different to ONNX (which is numpy handling as per this) so we can't use that directly.

We have a manual batched MatMul implementation, but it only supports fp32 currently. To support a quantized model requires additional logic around handling the quantization parameters.

Right now the NNAPI EP doesn't have fp16 support implemented either. That's because we typically need to have a good fallback path the CPU, and an fp16 model generally does not. There is limited support for fp16 on CPU so an fp16 model will typically cast data to fp32 for a lot of operations. This conversion back and forth between fp32 and fp16 often makes performance worse vs. using an fp32 model.

NNAPI support/performance on Android devices is semi-random adventure. Unless there's one specific device you're deploying to you really have to measure performance on each individual device to know if NNAPI or CPU is going to be faster.

skottmckay avatar Feb 23 '24 00:02 skottmckay

I know that fp16 data type doesn' t change the dimension of the matrices, but I wasn't sure if it was a problem with 3d matrices and I wanted to check if the problem had to do with quantization. However, for testing purposes, I manually inserted a squeeze to convert the input of one of the MatMuls to 2d (128 x 1024) (the other input was already in 2d) (followed by an unsqueeze):

immagine

And even if that MatMul is in 2d, only when I use NNAPI the app crash (with error "usupported operator MatMul") even if with CPU works correctly, I checked the log and I haven't found the [MatMul] operator that I have changed, but I have found this error:

[V:onnxruntime:ort-java, op_builder_helpers.cc:261 IsSupportedBatchMatMul] A and B must have at least three dimensions and have the same leading dimensions except for the last two. A shape: [ 128 1024 ], B shape: [ 1024 2048 ]

And the dimensions correspond to those of the matrices multiplied by the MatMul that I modified, so 3d MatMul not work with NNAPI, but 2d MatMul crash with NNAPI, why? Maybe I set the quantization wrong? At the moment I tried:

activations: QUInt8, asymmetric, weights: QUInt8, asymmetric, format: QDQ, mode: QLinearOps

And other combinations (from the zero_points in the screenshot you can see that I also tried QInt8 and symmetric)

What are the static quantization options supported by NNAPI with onnxruntime?

niedev avatar Feb 23 '24 18:02 niedev

Ok, I managed to make the 2D MatMul compatible using QLinearMatMul, thanks to these quantization settings (and leaving the graph optimization to ALL_OPT):

activations: QUInt8, symmetric, weights: QUint8, symmetric, format: QOperator, mode: QLinearOps

But, according to the onnruntime log and source code, the Reshape, Unsqueeze, Gather, Mul, Cast, and Transpose operators only support fp32 matrices/values as input, is this true or am I missing something? If this is the case I cannot make the model compatible with NNAPI (considering that many of these operators are used for input_ids or the attention_mask, which are of type int64)

niedev avatar Feb 24 '24 14:02 niedev

The 2D MatMul should work with NNAPI without needing to use QLinearMatMul. Was there log output saying why it was considered unsupported immediately prior to the 'unsupported operator' error?

u8u8 should be supported with QDQ MatMul

https://github.com/microsoft/onnxruntime/blob/430a086f22684ad0020819dc3e7712f36fe9f016/onnxruntime/core/providers/nnapi/nnapi_builtin/builders/helper.cc#L166-L176

Can you share the model that had that issue? Or extract part of the model that shows the problem by using extract_model from https://github.com/onnx/onnx/blob/main/docs/PythonAPIOverview.md#utility-functions?

IsSupportedBatchMatMul output can be ignored for 2D input. we probably shouldn't output anything for that condition given we always call IsSupportedBatchMatMul first.

https://github.com/microsoft/onnxruntime/blob/430a086f22684ad0020819dc3e7712f36fe9f016/onnxruntime/core/providers/nnapi/nnapi_builtin/builders/impl/gemm_op_builder.cc#L298-L301

Reshape/Unsqueeze/Gather/Mul/Transpose should support 8-bit. e.g. the code here is checking quantization parameters

https://github.com/microsoft/onnxruntime/blob/430a086f22684ad0020819dc3e7712f36fe9f016/onnxruntime/core/providers/nnapi/nnapi_builtin/builders/impl/reshape_op_builder.cc#L140-L156

Cast only supports casting to float or int32. Not sure why it doesn't allow casting to other types, but I assuming casting to 8-bit isn't meaningful as you'd have no quantization parameters.

skottmckay avatar Feb 26 '24 00:02 skottmckay

Ok, thank you, I uploaded the complete models in the releases of this repository (which also contains the demo app I made to test the inference), the model with the MatMul that crashes is "Madlad Decoder QDQ MatMul".

The error log is as follows:

[V:onnxruntime:ort-java, model_builder.cc:441 AddOperations] Adding node [636] 
[V:onnxruntime:ort-java, gemm_op_builder.cc:384 IsOpSupportedImpl] B of MatMul must be known 
Shutting down VM 
AndroidRuntime  com.bluetooth.communicatorexample  E  FATAL EXCEPTION: main 
                                  Process: com.bluetooth.communicatorexample, PID: 27728 
                                  java.lang.RuntimeException: ai.onnxruntime.OrtException: Error code - ORT_FAIL - message: base_op_builder.cc:55 AddToModelBuilder Unsupported operator MatMul 

Where the node [636] is the MatMul modified by me to make it 2D (name: /block.0/layer.1/EncDecAttention/k/MatMul), in the rest of the log this node appears only in lines in which it is declared that QuantizeLinear or other nodes linked to [636] are supported, for example:

[V:onnxruntime:ort-java, nnapi_execution_provider.cc:151 operator()] Node supported: [1] Operator type: [DequantizeLinear] index: [568] name: [encoder_hidden_states_squeezed_DequantizeLinear] as part of the NodeUnit type: [MatMul] index: [636] name: [/block.0/layer.1/EncDecAttention/k/MatMul] 

It seems that the problem is that one of the inputs is not known, but this problem does not exist when I use the CPU or with the model quantized with QLinearMatMul (the inference also works without problems in both cases).

However I will try to quantize the operators mentioned (Reshape/Unsqueeze/Gather/Mul/Transpose) to make them compatible.

I have another doubt though, I should be able to make the Cast produce only int32 or fp32 type values as output, but what are the supported input types? (I didn't find them in the source code)

niedev avatar Feb 26 '24 19:02 niedev

There's an issue with how groups of QDQ nodes (2xDQ -> MatMul -> Q) are being processed. We're looking at nodes in their topological order, but this isn't good enough as the DQ node for the initializer is looked at very early, we determine that is supported, but the nodes in that partition get broken up by an unsupported node prior to getting the actual MatMul node. That breaks the ability to convert the 2xDQ -> MatMul -> Q into a quantized NNAPI.

We probably need to add something to special-case an initializer -> DQ so that we delay processing until we get to the fp32 node (MatMul in this case) that it's providing input to so that we guarantee it will be in the same partition.

skottmckay avatar Feb 28 '24 09:02 skottmckay

Ok, maybe in my case the unsupported nodes are Squeeze and Unsqueeze, I don't know why but if I don' t quantize these nodes or if I quantize them with QDQ format I get this error:

base_op_builder.cc:170 IsNodeUnitTypeSupported] QDQ NodeUnit [Squeeze] is not supported for now

If instead I quantize them with the QOperator format I receive this error:

base_op_builder.cc:146 HasSupportedInputOutputsImpl] [Squeeze] Input type: [2] is not supported for now 

And in both cases I also have the classic: Node supported: [0] Operator type: [Squeeze]

I did the QOperator quantization only on the MatMul and then the QDQ quantization on the other operators and now the only operators not supported are: Cast (even when the output is fp32), Squeeze and Unsqueeze.

Do you know how to make Squeeze and Unsqueeze compatible? Or if there are other methods to make MatMuls 2D? (since now I use squeeze and unsqueeze to do this but if they are not NNAPI compatible I can't use them)

niedev avatar Feb 28 '24 11:02 niedev

Do you know how to make Squeeze and Unsqueeze compatible?

The NNAPI EP doesn't currently support QDQ Squeeze and QDQ Unsqueeze.

Or if there are other methods to make MatMuls 2D?

QDQ Reshape should be supported.

edgchen1 avatar Feb 28 '24 20:02 edgchen1

But they are incompatible even when I don't quantize Squeeze and Unsqueeze

niedev avatar Feb 28 '24 20:02 niedev

Anyway thanks, I'll try reshape 👍

niedev avatar Feb 28 '24 23:02 niedev

After inserting all the necessary reshapes all the nodes are supported except 1 Gather, but the problem is that NNAPI only uses the CPU both with my phone and with another one (in fact the performance has not improved), both with a snapdragon Soc (8 plus gen 1 and 778G):

DeviceManager::DeviceManager
Failed to parse result of GetServerConfigurableFlag, errno=34
findAvailableDevices
[V:onnxruntime:ort-java, nnapi_execution_provider.cc:74 NnapiExecutionProvider] Found devices [] in NNAPI
[I:onnxruntime:, inference_session.cc:1583 Initialize] Initializing session.
[I:onnxruntime:, inference_session.cc:1620 Initialize] Adding default CPU execution provider.

(I also tested with a phone running Exynos and it seems that in that case the GPU or NPU is used)

Is it an onnxruntime issue or are Qualcomm processors not supported by NNAPI? (with the 8 plus gen 1 phone I used Geekbench ML (uses Tensorflow Lite) and scores with NNAPI are lower than with CPU only, so I think they are actually not supported 😭)

niedev avatar Mar 04 '24 00:03 niedev

AFAIK Qualcomm chips can work with NNAPI but it's up to the chip vendor to implement the low level NNAPI interface on a chip by chip basis.

I wouldn't have expected to see an empty device list though unless there's some flag you're setting that's affecting what's returned.

https://github.com/microsoft/onnxruntime/blob/978c40d85310a1d9b8a6069be853bc3dbec44e18/onnxruntime/core/providers/nnapi/nnapi_builtin/nnapi_execution_provider.cc#L66-L76

https://github.com/microsoft/onnxruntime/blob/978c40d85310a1d9b8a6069be853bc3dbec44e18/onnxruntime/core/providers/nnapi/nnapi_builtin/nnapi_api_helper.cc#L68-L69

There may be output from nnapi itself in the log that explains it if you try adb shell setprop debug.nn.vlog 1

skottmckay avatar Mar 12 '24 01:03 skottmckay

Yes I was using the CPU_DISABLED flag, without that I only get one available device (I imagine the CPU). At first I thought that most processors supported NNAPI (I couldn't verify because there are no lists of compatible Socs), but now I have some doubts, so I decided to only use onnxruntime with the CPU and to use the kv-cache , that way I was able to get good performance. Thanks so much anyway for the help!

niedev avatar Mar 12 '24 23:03 niedev