rten icon indicating copy to clipboard operation
rten copied to clipboard

Support u8 x u8 non range-reduced operands for MatMulInteger

Open robertknight opened this issue 11 months ago • 1 comments

If a model is quantized using dynamic quantization, and quantization is enabled for MatMul operators where the RHS is not a constant (see MatMulConstBOnly in the ONNX Runtime quantization tools), the resulting model will have MatMulInteger operators with u8 x u8 operands, which is not currently supported. Unlike MatMulInteger with a constant RHS, the range of values in the RHS is not restricted to 7-bit range ([-64, 63] or [0, 127]). This means that the current AVX2 kernel for u8 x i8 may encounter saturation. ORT handles this by using a different, less efficient, kernel for the u8 x u8 case under AVX2.

Example model: https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2/blob/main/onnx/model_qint8_arm64.onnx

A workaround is to quantize the model with MatMulConstBOnly=True.

robertknight avatar Feb 12 '25 08:02 robertknight

u8 x u8 is also needed for models that have been dynamically quantized using activations=u8, weights=u8 in order to work around the lack of support for the default activations=u8, weights=i8 quantization in ORT's ConvInteger operator. For example the model_quantized.onnx file in https://huggingface.co/onnx-community/dinov2-large/tree/main/onnx.

robertknight avatar Mar 29 '25 06:03 robertknight