TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

Calibration for NVFP4?

Open mklachko opened this issue 9 months ago • 3 comments

The README says:

nvfp4: Weights are quantized to NVFP4 block-wise with size 16. Activation global scale are calibrated. fp8: Weights are quantized to FP8 tensor wise. Activation ranges are calibrated tensor wise.

So if I want to quantize both weights and activations to NVFP4, do I need to calibrate the activations range? If yes, will the activations also will have group size 16? And if yes, will each group have its own range, or some globally calibrated range? What does "global scale" mean here?

mklachko avatar Mar 24 '25 18:03 mklachko

Hi @mklachko For NVFP4, it introduces two-levels quantization method, the first top-level is the per-Tensor quantization scaling factor, the second level is the fine-grained blockwise quantization scaling factor and yes each group has its own range and scaling factor.

@Tracin @RalphMao to chime in to provide more details.

Thanks June

juney-nvidia avatar Mar 24 '25 23:03 juney-nvidia

@juney-nvidia thanks! Why do we need the two levels of scaling factors? Is it because of the limited range of FP8? Can you please point me to the relevant code (or documentation explaining it) where these two levels of scaling factors are being applied?

To clarify: only the per-Tensor scaling factor is being calibrated during calibration, correct? Are fine-grained block scaling factors being recomputed dynamically for every input, or are they also static?

mklachko avatar Mar 24 '25 23:03 mklachko

@juney-nvidia thanks! Why do we need the two levels of scaling factors? Is it because of the limited range of FP8? Can you please point me to the relevant code (or documentation explaining it) where these two levels of scaling factors are being applied?

To clarify: only the per-Tensor scaling factor is being calibrated during calibration, correct? Are fine-grained block scaling factors being recomputed dynamically for every input, or are they also static?

Yes, the needs for two levels of scaling factors are for accuracy purpose. For activation, the per-Tensor scaling factor is computed offline, the blockwise scaling factor is computed online.

juney-nvidia avatar Mar 25 '25 08:03 juney-nvidia