onnx
onnx copied to clipboard
DynamicQuantizeLinear opset 20 and float 8
Description
DynamicQuantizeLinear only supports uint 8. This PR adds support for int8 and float 8.
Motivation and Context
The operator is used to dynamically quantize an input.
Description
DynamicQuantizeLinear only supports uint 8. This PR adds support for int8 and float 8.
Motivation and Context
The operator is used to dynamically quantize an input.
We also need to add fp8 support for MatMulInteger to support dynamic quantization for fp8.
We also need to add fp8 support for MatMulInteger to support dynamic quantization for fp8.
The function defined by CUDA cublasLtMatMul allows more than one option for the output type with the same input types. Since there is no scale for the output, the output type could be float32, float16 or bfloat16. I started to modify QLinearMatMul in PR #5473 which can be seen as a more generic version of MatMulInteger. There is also the transposition out of the equation and cublasLtMatMul only supports A.T @ B
with float 8 (and column major order). Zero point is not used to float 8 types. The name MatMulInteger
also includes Integer
in it. Is it possible to modify the quantization tools to use QLinearMatMul instead?
Nit: "convertion" -> "conversion"
Is this ready for reviews?
The only thing which wiuld require a larger consensus is the method i used to estimate the scale for float 8. Models are usually trained with float 8 and the scale estimation is part of the training. It is different from what i came up with.
Cc @gramalingam