TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

performance: fp8 vs smoothquant int8

Open enozhu opened this issue 1 year ago • 1 comments

reference: https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/quantization-in-TRT-LLM.md#performance

image Model Batch Size Speedup (FP8 v.s. FP16) Speedup (INT8 SQ v.s. FP16) GPT-J 1 1.40x 1.40x GPT-J 8 1.44x 1.30x LLaMA-v2-7B 1 1.51x 1.47x LLaMA-v2-7B 8 1.40x 1.32x

my question is : why fp8 speedup is better than int8 smoothquant, fp8 and int8 tensor core TFLOPS is same on H100

enozhu avatar Aug 01 '24 04:08 enozhu

int8 smoothquant has quant/dequant cost

renjie0 avatar Aug 08 '24 05:08 renjie0

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."

github-actions[bot] avatar Sep 08 '24 02:09 github-actions[bot]

@enozhu FP8 uses per-tensor mode, did you compare INT8 also in per-tensor mode? If yes, there are a few possibilities

  1. There are two smooth operations can not be merged with LN in one layer. This could have overhead.
  2. INT8 uses CUTLASS, parameters like thread block swizzle may not tuned to be optimal.

Tracin avatar Sep 11 '24 08:09 Tracin