onnxruntime icon indicating copy to clipboard operation
onnxruntime copied to clipboard

NOT_IMPLEMENTED : Could not find an implementation for ConvInteger(10) node with name 'Conv_0_quant'

Open chinmayjog13 opened this issue 2 years ago • 23 comments

Describe the issue

Following the documentation, I dynamically quantiized a resnet based model. The model is quantized and saved without error. However, when I try to create an inference session using the quantized model, the code crashes with the following error.

>>> ort_session = ort.InferenceSession(int8_path, providers=['CPUExecutionProvider'])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/chinmay/anaconda3/envs/v_pytorch2/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 360, in __init__
    self._create_inference_session(providers, provider_options, disabled_optimizers)
  File "/home/chinmay/anaconda3/envs/v_pytorch2/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 408, in _create_inference_session
    sess.initialize_session(providers, provider_options, disabled_optimizers)
onnxruntime.capi.onnxruntime_pybind11_state.NotImplemented: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for ConvInteger(10) node with name 'Conv_0_quant'

This is a duplicate of #12558 , which was closed a few months ago so I guess the support should have been added in onnxruntime, but I am still getting the same error.

To reproduce

import onnx
from onnxruntime.quantization import quantize_dynamic, QuantType

model_fp32 = 'path/to/the/model.onnx'
model_quant = 'path/to/the/model.quant.onnx'
quantized_model = quantize_dynamic(model_fp32, model_quant)

import onnxruntime as ort
ort_session = ort.InferenceSession(model_quant, providers=['CPUExecutionProvider'])

fp32_model quantized_model

Urgency

Not very urgent, but not too low priority also.

Platform

Linux

OS Version

Ubuntu 20.04

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.14.1

ONNX Runtime API

Python

Architecture

X64

Execution Provider

Default CPU

Execution Provider Library Version

No response

chinmayjog13 avatar May 10 '23 14:05 chinmayjog13

I am getting the same issue. I might try reverting to a previous build just to see if it was working before.

jpg-gamepad avatar May 22 '23 18:05 jpg-gamepad

I just tried the previous releases, they didn't work. Upon inspecting the code, it looks as if the patch may have never gotten deployed to release. Here and here

There should be a line that say class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 10, uint8_t_int8_t, ConvInteger);

Maybe @jchen351 knows what happened.

jpg-gamepad avatar May 22 '23 19:05 jpg-gamepad

I'm having the same issue. Can I help get the patch applied?

CrisLS avatar Jun 13 '23 12:06 CrisLS

Same issue here. Do we have a timeline for the patch?

johnyquest7 avatar Jun 17 '23 01:06 johnyquest7

On the similar issue #3130, there is comment that may temporarily solve the issue. But when we change weight_type=QuantType.QInt8 to QuantType.QUInt8, the onnx quantized model seems to perform slower, as also mentioned in that issue. And that also happened to me.

Beside, I think these issues happen due to onnx does not support ConvInteger for signed Int8 datatype, here.

trnhattan avatar Jun 26 '23 08:06 trnhattan

In ONNX Runtime docs, in Method Selection subsection, they said that:

In general, it is recommended to use dynamic quantization for RNNs and transformer-based models, and static quantization for CNN models.

So I tried to follow the an end-to-end example given in the docs, and somehow it worked, the weights are int8. Here are steps that I did:

  1. Convert FaceNet-InceptionResNet to ONNX model.
  2. Create CalibrationDataReader by using some facial images.
  3. Execute quantize_static()

trnhattan avatar Jun 27 '23 01:06 trnhattan

I have the same problem. I am finding that onnxruntime does not support the ConvInteger layer unfortunately, that means dynamic quantization is not working in onnxruntime if initial model has CNNs inside. Very sad!

Apisteftos avatar Aug 06 '23 16:08 Apisteftos

I found a workaround to solve this, by setting:

quantize_dynamic(input_model, output_model, weight_type=QuantType.QInt8, nodes_to_exclude=['/conv1/Conv'])

Here, nodes_to_exclude should be the list of conv layer name in your model. You may find it in the error message when loading the model using InferenceSession. For example, for the error message mentioned in the title of this issue, it would be: 'Conv_0_quant'.

greyovo avatar Aug 09 '23 15:08 greyovo

Another workaround is to exclude all operators causing the issue. For example:

In:

quantize_dynamic(input_model, output_model, weight_type=QuantType.QInt8, operators_to_quantize=['MatMul', 'Attention', 'LSTM', 'Gather', 'Transpose', 'EmbedLayerNormalization'])

"Conv" was removed from operators_to_quantize.

tommate avatar Aug 24 '23 20:08 tommate

I'm also facing the same error tired solving this problem.

Ibrah-N avatar Sep 05 '23 02:09 Ibrah-N

Same issue for segformer model NotImplemented: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for ConvInteger(10) node with name '/segformer/encoder/patch_embeddings.0/proj/Conv_quant'

ogencoglu avatar Nov 13 '23 17:11 ogencoglu

I just hit this issue too. I will skip quantizing Conv operators for now.

theonewolf avatar Nov 13 '23 18:11 theonewolf

same error, it only works with QUint8 but not QInt8

hu8813 avatar Dec 14 '23 22:12 hu8813

I found a workaround to solve this, by setting:

quantize_dynamic(input_model, output_model, weight_type=QuantType.QInt8, nodes_to_exclude=['/conv1/Conv'])

Here, nodes_to_exclude should be the list of conv layer name in your model. You may find it in the error message when loading the model using InferenceSession. For example, for the error message mentioned in the title of this issue, it would be: 'Conv_0_quant'.

Hi, I tried this while trying to quantize a whisper model (seq2seq) (see here) however I am getting the following error

NotImplemented: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for ConvInteger(10) node with name '/conv1/Conv_quant'

I tried to use nodes_to_exclude in my AutoQuantizationConfig to exclude the node but the error is still the same

qconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False, nodes_to_exclude=['/conv1/Conv_quant'])

Any help would be appreciated! 🙏

joelleoqiyi avatar May 22 '24 03:05 joelleoqiyi

I had to better describe the operators to quantize and the following worked for me:

dqconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False)
dqconfig.nodes_to_exclude = ['Conv_quant']
dqconfig.operators_to_quantize=['MatMul', 'Attention', 'LSTM', 'Gather', 'Transpose', 'EmbedLayerNormalization']

Hope this helps, Thomas

tommate avatar May 22 '24 06:05 tommate

still only works with QUInt8 for me

moritzsur avatar May 24 '24 15:05 moritzsur

@ogencoglu

werer you able to solve the issue. I am facing the same with quantization and also i have more inference time with ONNX can be seen here any help will be appreciated

Fazankabir avatar Jun 20 '24 11:06 Fazankabir

@fazankabir No solution found so far.

ogencoglu avatar Jun 20 '24 12:06 ogencoglu

I am able to do quantization with:

model_fp32 = 'model_Segformer.onnx'  
model_quant = "model_dynamic_quant.onnx"
quantized_model = quantize_dynamic(model_fp32, model_quant, weight_type=QuantType.QUInt8) 

instead of QuantType.QInt8

but while doing the inference it is taking even more time. do you also have the issue with more inference time after exporting the segformer to onnx @ogencoglu ?

Fazankabir avatar Jun 20 '24 12:06 Fazankabir

I haven't tested that (onnx is a must for my case) but quantized onnx model has longer inference time than full precision one. So I ditched segformer all together. @Fazankabir

ogencoglu avatar Jun 20 '24 14:06 ogencoglu

I am able to do quantization with:

model_fp32 = 'model_Segformer.onnx'  
model_quant = "model_dynamic_quant.onnx"
quantized_model = quantize_dynamic(model_fp32, model_quant, weight_type=QuantType.QUInt8) 

instead of QuantType.QInt8

but while doing the inference it is taking even more time. do you also have the issue with more inference time after exporting the segformer to onnx @ogencoglu ?

Yeh you right, a lot of operators in onnx are running in CPU instead of GPU after being quantized (maybe they haven't supported yet), so it might take more time to run some quantized model

mattam301 avatar Jun 20 '24 14:06 mattam301

Any news about this topic? I have the same issue on this code:

import onnx
import onnxruntime as ort
import numpy as np

model = onnx.parser.parse_model(
    """
    <ir_version: 5, opset_import: [ "" : 18 ]>
    agraph (int8[N, 3, 32, 32] X) => (int32[N, 16, ?, ?] Y){
        Y = ConvInteger(X, W)
    }
    """
)
model.graph.initializer.extend(
    [onnx.numpy_helper.from_array(np.ones((16, 3, 3, 3), "int8"), "W")]
)

# Inference
feed = {"X": np.ones((1, 3, 32, 32), "int8")}
sess = ort.InferenceSession(model.SerializeToString(), providers=["CPUExecutionProvider"])

# Error is raised here
y = sess.run(None, feed)[0]

Johansmm avatar Oct 02 '24 20:10 Johansmm

  1. run
quantized_model = quantize_dynamic(
    model_path,
    quantized_model_path,
    weight_type=QuantType.QInt8,
    nodes_to_exclude=[],
)
import onnxruntime as ort

ort_session = ort.InferenceSession(
    quantized_model_path, providers=["CPUExecutionProvider"]
)

find node

.NotImplemented: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED :
 Could not find an implementation for ConvInteger(10) node with name '/model/patch_embed/proj/Conv_quant'
  1. nodes_to_exclude

delete _quant

/model/patch_embed/proj/Conv_quant => /model/patch_embed/proj/Conv

quantized_model = quantize_dynamic(
    model_path,
    quantized_model_path,
    weight_type=QuantType.QInt8,
    nodes_to_exclude=[
    "/model/patch_embed/proj/Conv"  # /model/patch_embed/proj/Conv_quant => /model/patch_embed/proj/Conv
    ],
)
import onnxruntime as ort

ort_session = ort.InferenceSession(
    quantized_model_path, providers=["CPUExecutionProvider"]
)

openmynet avatar Oct 19 '24 07:10 openmynet

I have a ResNet based model that I cannot quantize to any format except QuantType.QUInt8, I would like to quantize it to other formats also, but without excluding the Conv nodes. Also, has anyone managed to run the quantized model in GPU using CUDAExecutionProvider? Even if I define the CUDA provider first, it does the inference using the CPU provider, which takes more time to infer than the unquantized model... I would really appreciate a way to solve this.

JulArr22 avatar Oct 30 '24 13:10 JulArr22

I fixed it by excluding the failed nodes. Most of the failures were in the IntergerConv nodes.

import onnx

model_fp16_path = "./models/tusimple_18_V1_fp32.onnx"
model = onnx.load(model_fp16_path)

conv_nodes = []
for node in model.graph.node:
    if node.op_type == "Conv":
        conv_nodes.append(node.name)
conv_nodes

# ['/model/conv1/Conv',
#  '/model/layer1/layer1.0/conv1/Conv',
#  '/model/layer1/layer1.0/conv2/Conv',
#  '/model/layer1/layer1.1/conv1/Conv',
#  '/model/layer1/layer1.1/conv2/Conv',
#  '/model/layer2/layer2.0/conv1/Conv',
#  '/model/layer2/layer2.0/conv2/Conv',
#  '/model/layer2/layer2.0/downsample/downsample.0/Conv',
#  '/model/layer2/layer2.1/conv1/Conv',
#  '/model/layer2/layer2.1/conv2/Conv',
#  '/model/layer3/layer3.0/conv1/Conv',
#  '/model/layer3/layer3.0/conv2/Conv',
#  '/model/layer3/layer3.0/downsample/downsample.0/Conv',
#  '/model/layer3/layer3.1/conv1/Conv',
#  '/model/layer3/layer3.1/conv2/Conv',
#  '/model/layer4/layer4.0/conv1/Conv',
#  '/model/layer4/layer4.0/conv2/Conv',
#  '/model/layer4/layer4.0/downsample/downsample.0/Conv',
#  '/model/layer4/layer4.1/conv1/Conv',
#  '/model/layer4/layer4.1/conv2/Conv',
#  '/pool/Conv']

from onnxruntime.quantization import quantize_dynamic, QuantType

model_int8_path = "./models/model_int8_dynamic.onnx"
quantized_model = quantize_dynamic(
    model_fp16_path,  # Input model
    model_int8_path,  # Output model
    weight_type=QuantType.QInt8,
    nodes_to_exclude=conv_nodes,
)
print("Dynamic Quantization Complete!")

abaoxomtieu avatar Mar 09 '25 14:03 abaoxomtieu

And what is the inference speed @abaoxomtieu ? In my use case, partial quantization gets slower than non-quantized model on CPU.

ogencoglu avatar Mar 09 '25 21:03 ogencoglu

And what is the inference speed @abaoxomtieu ? In my use case, partial quantization gets slower than non-quantized model on CPU.

It's faster than float16 on my laptop

abaoxomtieu avatar Mar 11 '25 01:03 abaoxomtieu