onnx DynamicQuantizeLinear opset 20 and float 8

DynamicQuantizeLinear opset 20 and float 8

Open xadupre opened this issue 1 year ago • 6 comments

Description

DynamicQuantizeLinear only supports uint 8. This PR adds support for int8 and float 8.

Motivation and Context

The operator is used to dynamically quantize an input.

Aug 04 '23 14:08 xadupre

Description

DynamicQuantizeLinear only supports uint 8. This PR adds support for int8 and float 8.

Motivation and Context

The operator is used to dynamically quantize an input.

We also need to add fp8 support for MatMulInteger to support dynamic quantization for fp8.

Aug 08 '23 05:08 yufenglee

We also need to add fp8 support for MatMulInteger to support dynamic quantization for fp8.

The function defined by CUDA cublasLtMatMul allows more than one option for the output type with the same input types. Since there is no scale for the output, the output type could be float32, float16 or bfloat16. I started to modify QLinearMatMul in PR #5473 which can be seen as a more generic version of MatMulInteger. There is also the transposition out of the equation and cublasLtMatMul only supports A.T @ B with float 8 (and column major order). Zero point is not used to float 8 types. The name MatMulInteger also includes Integer in it. Is it possible to modify the quantization tools to use QLinearMatMul instead?

Aug 08 '23 09:08 xadupre

Nit: "convertion" -> "conversion"

Aug 08 '23 13:08 justinchuby

Is this ready for reviews?

Aug 19 '23 12:08 justinchuby

The only thing which wiuld require a larger consensus is the method i used to estimate the scale for float 8. Models are usually trained with float 8 and the scale estimation is part of the training. It is different from what i came up with.

Aug 19 '23 15:08 xadupre

Cc @gramalingam

Aug 25 '23 15:08 justinchuby

onnx onnx copied to clipboard

DynamicQuantizeLinear opset 20 and float 8

Description

Motivation and Context

Description

Motivation and Context

onnx
onnx copied to clipboard