onnxruntime
onnxruntime copied to clipboard
NOT_IMPLEMENTED : Could not find an implementation for ConvInteger(10) node with name 'Conv_0_quant'
Describe the issue
Following the documentation, I dynamically quantiized a resnet based model. The model is quantized and saved without error. However, when I try to create an inference session using the quantized model, the code crashes with the following error.
>>> ort_session = ort.InferenceSession(int8_path, providers=['CPUExecutionProvider'])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/chinmay/anaconda3/envs/v_pytorch2/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 360, in __init__
self._create_inference_session(providers, provider_options, disabled_optimizers)
File "/home/chinmay/anaconda3/envs/v_pytorch2/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 408, in _create_inference_session
sess.initialize_session(providers, provider_options, disabled_optimizers)
onnxruntime.capi.onnxruntime_pybind11_state.NotImplemented: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for ConvInteger(10) node with name 'Conv_0_quant'
This is a duplicate of #12558 , which was closed a few months ago so I guess the support should have been added in onnxruntime, but I am still getting the same error.
To reproduce
import onnx
from onnxruntime.quantization import quantize_dynamic, QuantType
model_fp32 = 'path/to/the/model.onnx'
model_quant = 'path/to/the/model.quant.onnx'
quantized_model = quantize_dynamic(model_fp32, model_quant)
import onnxruntime as ort
ort_session = ort.InferenceSession(model_quant, providers=['CPUExecutionProvider'])
Urgency
Not very urgent, but not too low priority also.
Platform
Linux
OS Version
Ubuntu 20.04
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
1.14.1
ONNX Runtime API
Python
Architecture
X64
Execution Provider
Default CPU
Execution Provider Library Version
No response
I am getting the same issue. I might try reverting to a previous build just to see if it was working before.
I just tried the previous releases, they didn't work. Upon inspecting the code, it looks as if the patch may have never gotten deployed to release. Here and here
There should be a line that say class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 10, uint8_t_int8_t, ConvInteger);
Maybe @jchen351 knows what happened.
I'm having the same issue. Can I help get the patch applied?
Same issue here. Do we have a timeline for the patch?
On the similar issue #3130, there is comment that may temporarily solve the issue. But when we change weight_type=QuantType.QInt8 to QuantType.QUInt8, the onnx quantized model seems to perform slower, as also mentioned in that issue. And that also happened to me.
Beside, I think these issues happen due to onnx does not support ConvInteger for signed Int8 datatype, here.
In ONNX Runtime docs, in Method Selection subsection, they said that:
In general, it is recommended to use dynamic quantization for RNNs and transformer-based models, and static quantization for CNN models.
So I tried to follow the an end-to-end example given in the docs, and somehow it worked, the weights are int8. Here are steps that I did:
- Convert FaceNet-InceptionResNet to ONNX model.
- Create
CalibrationDataReaderby using some facial images. - Execute
quantize_static()
I have the same problem. I am finding that onnxruntime does not support the ConvInteger layer unfortunately, that means dynamic quantization is not working in onnxruntime if initial model has CNNs inside. Very sad!
I found a workaround to solve this, by setting:
quantize_dynamic(input_model, output_model, weight_type=QuantType.QInt8, nodes_to_exclude=['/conv1/Conv'])
Here, nodes_to_exclude should be the list of conv layer name in your model. You may find it in the error message when loading the model using InferenceSession.
For example, for the error message mentioned in the title of this issue, it would be: 'Conv_0_quant'.
Another workaround is to exclude all operators causing the issue. For example:
In:
quantize_dynamic(input_model, output_model, weight_type=QuantType.QInt8, operators_to_quantize=['MatMul', 'Attention', 'LSTM', 'Gather', 'Transpose', 'EmbedLayerNormalization'])
"Conv" was removed from operators_to_quantize.
I'm also facing the same error tired solving this problem.
Same issue for segformer model
NotImplemented: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for ConvInteger(10) node with name '/segformer/encoder/patch_embeddings.0/proj/Conv_quant'
I just hit this issue too. I will skip quantizing Conv operators for now.
same error, it only works with QUint8 but not QInt8
I found a workaround to solve this, by setting:
quantize_dynamic(input_model, output_model, weight_type=QuantType.QInt8, nodes_to_exclude=['/conv1/Conv'])Here,
nodes_to_excludeshould be the list of conv layer name in your model. You may find it in the error message when loading the model usingInferenceSession. For example, for the error message mentioned in the title of this issue, it would be: 'Conv_0_quant'.
Hi, I tried this while trying to quantize a whisper model (seq2seq) (see here) however I am getting the following error
NotImplemented: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for ConvInteger(10) node with name '/conv1/Conv_quant'
I tried to use nodes_to_exclude in my AutoQuantizationConfig to exclude the node but the error is still the same
qconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False, nodes_to_exclude=['/conv1/Conv_quant'])
Any help would be appreciated! 🙏
I had to better describe the operators to quantize and the following worked for me:
dqconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False)
dqconfig.nodes_to_exclude = ['Conv_quant']
dqconfig.operators_to_quantize=['MatMul', 'Attention', 'LSTM', 'Gather', 'Transpose', 'EmbedLayerNormalization']
Hope this helps, Thomas
still only works with QUInt8 for me
@ogencoglu
werer you able to solve the issue. I am facing the same with quantization and also i have more inference time with ONNX can be seen here any help will be appreciated
@fazankabir No solution found so far.
I am able to do quantization with:
model_fp32 = 'model_Segformer.onnx'
model_quant = "model_dynamic_quant.onnx"
quantized_model = quantize_dynamic(model_fp32, model_quant, weight_type=QuantType.QUInt8)
instead of QuantType.QInt8
but while doing the inference it is taking even more time. do you also have the issue with more inference time after exporting the segformer to onnx @ogencoglu ?
I haven't tested that (onnx is a must for my case) but quantized onnx model has longer inference time than full precision one. So I ditched segformer all together. @Fazankabir
I am able to do quantization with:
model_fp32 = 'model_Segformer.onnx' model_quant = "model_dynamic_quant.onnx" quantized_model = quantize_dynamic(model_fp32, model_quant, weight_type=QuantType.QUInt8)instead of QuantType.QInt8
but while doing the inference it is taking even more time. do you also have the issue with more inference time after exporting the segformer to onnx @ogencoglu ?
Yeh you right, a lot of operators in onnx are running in CPU instead of GPU after being quantized (maybe they haven't supported yet), so it might take more time to run some quantized model
Any news about this topic? I have the same issue on this code:
import onnx
import onnxruntime as ort
import numpy as np
model = onnx.parser.parse_model(
"""
<ir_version: 5, opset_import: [ "" : 18 ]>
agraph (int8[N, 3, 32, 32] X) => (int32[N, 16, ?, ?] Y){
Y = ConvInteger(X, W)
}
"""
)
model.graph.initializer.extend(
[onnx.numpy_helper.from_array(np.ones((16, 3, 3, 3), "int8"), "W")]
)
# Inference
feed = {"X": np.ones((1, 3, 32, 32), "int8")}
sess = ort.InferenceSession(model.SerializeToString(), providers=["CPUExecutionProvider"])
# Error is raised here
y = sess.run(None, feed)[0]
- run
quantized_model = quantize_dynamic(
model_path,
quantized_model_path,
weight_type=QuantType.QInt8,
nodes_to_exclude=[],
)
import onnxruntime as ort
ort_session = ort.InferenceSession(
quantized_model_path, providers=["CPUExecutionProvider"]
)
find node
.NotImplemented: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED :
Could not find an implementation for ConvInteger(10) node with name '/model/patch_embed/proj/Conv_quant'
- nodes_to_exclude
delete _quant
/model/patch_embed/proj/Conv_quant => /model/patch_embed/proj/Conv
quantized_model = quantize_dynamic(
model_path,
quantized_model_path,
weight_type=QuantType.QInt8,
nodes_to_exclude=[
"/model/patch_embed/proj/Conv" # /model/patch_embed/proj/Conv_quant => /model/patch_embed/proj/Conv
],
)
import onnxruntime as ort
ort_session = ort.InferenceSession(
quantized_model_path, providers=["CPUExecutionProvider"]
)
I have a ResNet based model that I cannot quantize to any format except QuantType.QUInt8, I would like to quantize it to other formats also, but without excluding the Conv nodes. Also, has anyone managed to run the quantized model in GPU using CUDAExecutionProvider? Even if I define the CUDA provider first, it does the inference using the CPU provider, which takes more time to infer than the unquantized model... I would really appreciate a way to solve this.
I fixed it by excluding the failed nodes. Most of the failures were in the IntergerConv nodes.
import onnx
model_fp16_path = "./models/tusimple_18_V1_fp32.onnx"
model = onnx.load(model_fp16_path)
conv_nodes = []
for node in model.graph.node:
if node.op_type == "Conv":
conv_nodes.append(node.name)
conv_nodes
# ['/model/conv1/Conv',
# '/model/layer1/layer1.0/conv1/Conv',
# '/model/layer1/layer1.0/conv2/Conv',
# '/model/layer1/layer1.1/conv1/Conv',
# '/model/layer1/layer1.1/conv2/Conv',
# '/model/layer2/layer2.0/conv1/Conv',
# '/model/layer2/layer2.0/conv2/Conv',
# '/model/layer2/layer2.0/downsample/downsample.0/Conv',
# '/model/layer2/layer2.1/conv1/Conv',
# '/model/layer2/layer2.1/conv2/Conv',
# '/model/layer3/layer3.0/conv1/Conv',
# '/model/layer3/layer3.0/conv2/Conv',
# '/model/layer3/layer3.0/downsample/downsample.0/Conv',
# '/model/layer3/layer3.1/conv1/Conv',
# '/model/layer3/layer3.1/conv2/Conv',
# '/model/layer4/layer4.0/conv1/Conv',
# '/model/layer4/layer4.0/conv2/Conv',
# '/model/layer4/layer4.0/downsample/downsample.0/Conv',
# '/model/layer4/layer4.1/conv1/Conv',
# '/model/layer4/layer4.1/conv2/Conv',
# '/pool/Conv']
from onnxruntime.quantization import quantize_dynamic, QuantType
model_int8_path = "./models/model_int8_dynamic.onnx"
quantized_model = quantize_dynamic(
model_fp16_path, # Input model
model_int8_path, # Output model
weight_type=QuantType.QInt8,
nodes_to_exclude=conv_nodes,
)
print("Dynamic Quantization Complete!")
And what is the inference speed @abaoxomtieu ? In my use case, partial quantization gets slower than non-quantized model on CPU.
And what is the inference speed @abaoxomtieu ? In my use case, partial quantization gets slower than non-quantized model on CPU.
It's faster than float16 on my laptop