TensorRT-LLM performance: fp8 vs smoothquant int8

reference: https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/quantization-in-TRT-LLM.md#performance

Model Batch Size Speedup (FP8 v.s. FP16) Speedup (INT8 SQ v.s. FP16) GPT-J 1 1.40x 1.40x GPT-J 8 1.44x 1.30x LLaMA-v2-7B 1 1.51x 1.47x LLaMA-v2-7B 8 1.40x 1.32x

my question is : why fp8 speedup is better than int8 smoothquant, fp8 and int8 tensor core TFLOPS is same on H100

Aug 01 '24 04:08 enozhu

int8 smoothquant has quant/dequant cost

Aug 08 '24 05:08 renjie0

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."

Sep 08 '24 02:09 github-actions[bot]

@enozhu FP8 uses per-tensor mode, did you compare INT8 also in per-tensor mode? If yes, there are a few possibilities

There are two smooth operations can not be merged with LN in one layer. This could have overhead.
INT8 uses CUTLASS, parameters like thread block swizzle may not tuned to be optimal.

Sep 11 '24 08:09 Tracin