lishicheng1996

Results 2 issues of lishicheng1996

Following the logic of MMHA_FP8_SCALE_Q_INSTEAD_OF_K and MMHA_FP8_SCALE_P_INSTEAD_OF_V, I implemented the INT8 version. It is theoretically equivalent to the original compute logic without any numeric accuracy degradation. I tested the speed...

### Describe the feature request Onnxruntime int8 quantization may generate a INT8 calibration cache file to store the scales or tensor ranges, just like TRT, to avoid doing calibration with...

feature request
quantization