lishicheng1996
Results
2
issues of
lishicheng1996
Following the logic of MMHA_FP8_SCALE_Q_INSTEAD_OF_K and MMHA_FP8_SCALE_P_INSTEAD_OF_V, I implemented the INT8 version. It is theoretically equivalent to the original compute logic without any numeric accuracy degradation. I tested the speed...
### Describe the feature request Onnxruntime int8 quantization may generate a INT8 calibration cache file to store the scales or tensor ranges, just like TRT, to avoid doing calibration with...
feature request
quantization