performance: fp8 vs smoothquant int8
reference: https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/quantization-in-TRT-LLM.md#performance
Model Batch Size Speedup (FP8 v.s. FP16) Speedup (INT8 SQ v.s. FP16)
GPT-J 1 1.40x 1.40x
GPT-J 8 1.44x 1.30x
LLaMA-v2-7B 1 1.51x 1.47x
LLaMA-v2-7B 8 1.40x 1.32x
my question is : why fp8 speedup is better than int8 smoothquant, fp8 and int8 tensor core TFLOPS is same on H100
int8 smoothquant has quant/dequant cost
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."
@enozhu FP8 uses per-tensor mode, did you compare INT8 also in per-tensor mode? If yes, there are a few possibilities
- There are two smooth operations can not be merged with LN in one layer. This could have overhead.
- INT8 uses CUTLASS, parameters like thread block swizzle may not tuned to be optimal.