TensorRT-LLM does NVIDIA L20 GPUs support FP8 quantization?

System Info

CPU architecture: x86_64 Host RAM: 1TB GPU: 2xL20 SXM Container: Manually built container with TRT 9.3 Dockerfile.trt_llm_backend TensorRT-LLM version: 0.12.0.dev2024070200" Driver Version: 550.54.15 CUDA Version: 12.4 OS: Ubuntu 22.04

[TensOrRT-LLM]ГERROR]tensort llm.:comon::TlmException: [TensorRT-LM][ERR] Assertion failed: Fp8 FMHA cannot be enabled on pre-Hopper Arcdh.

CUDA_VISIBLE_DEVICES=0,1 python ../quantization/quantize.py \ --model_dir /nvme0/hub/modelscope/baichuan-inc/Baichuan2-7B-Chat \ --dtype bfloat16 \ --qformat fp8 \ --calib_dataset /nvme0/ai/fp8/TensorRT-LLM/cnn_dailymail \ --output_dir ./quantized_fp8 \ --calib_size 256

CUDA_VISIBLE_DEVICES=0,1 trtllm-build --checkpoint_dir ./quantized_fp8/ \ --output_dir ./quantized_fp8-1-gpu/ \ --use_fp8_context_fmha enable \ --gemm_plugin bfloat16

ERROR:

Who can help?

No response

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

CUDA_VISIBLE_DEVICES=0,1 python ../quantization/quantize.py \ --model_dir /nvme0/hub/modelscope/baichuan-inc/Baichuan2-7B-Chat \ --dtype bfloat16 \ --qformat fp8 \ --calib_dataset /nvme0/ai/fp8/TensorRT-LLM/cnn_dailymail \ --output_dir ./quantized_fp8 \ --calib_size 256

CUDA_VISIBLE_DEVICES=0,1 trtllm-build --checkpoint_dir ./quantized_fp8/ \ --output_dir ./quantized_fp8-1-gpu/ \ --use_fp8_context_fmha enable \ --gemm_plugin bfloat16

Jul 08 '24 10:07 jinweida

@Tracin could you please have a look? Thanks

Jul 08 '24 12:07 QiJune

why you use gemm_plugin with bfloat16 and not fp8?

also they mention to disable it: https://nvidia.github.io/TensorRT-LLM/performance/perf-best-practices.html#gemm-plugin

also where you see that NVIDIA L20 GPU supports fp8?

Jul 08 '24 19:07 geraldstanje1

@jinweida Looks like FP8 FMHA can not be supported on L20, please remove --use_fp8_context_fmha enable from your command.

Jul 09 '24 02:07 Tracin

@Tracin where do you see that NVIDIA L20 GPU supports fp8? he uses also: --qformat fp8

spec:

edit: just saw there is fp8...

Jul 09 '24 03:07 geraldstanje1

@Tracin The dealer say L20 GPU supports fp8

Jul 09 '24 06:07 jinweida

@jinweida Yeah, I mean you can still use FP8 gemm on L20 if you remove --use_fp8_context_fmha enable. FP8 FMHA is a new feature and not cover L20 for now.

Jul 09 '24 07:07 Tracin

how do I accelerate fp8 with L20? @Tracin

Jul 09 '24 07:07 jinweida

how do I accelerate fp8 with L20? @Tracin

If you mean accelerate LLM on L20 by FP8 gemm, you are doing in the correct way.

python ../quantization/quantize.py \ --model_dir /nvme0/hub/modelscope/baichuan-inc/Baichuan2-7B-Chat \ --dtype bfloat16 \ --qformat fp8 \ --calib_dataset /nvme0/ai/fp8/TensorRT-LLM/cnn_dailymail \ --output_dir ./quantized_fp8 \ --calib_size 256

CUDA_VISIBLE_DEVICES=0,1 trtllm-build --checkpoint_dir ./quantized_fp8/ \ --output_dir ./quantized_fp8-1-gpu/

Jul 09 '24 07:07 Tracin

The FP8 FMHA on SM89 (L20) is on going. So, you could only enable FP8 GEMM on L20 now.

Jul 17 '24 08:07 byshiue

The FP8 FMHA on SM89 (L20) is on going. So, you could only enable FP8 GEMM on L20 now.

Do you have an estimate for when the implementation will be ready? I ran benchmarks for FP8 Flash Attention V2 on the L40S with Triton Kernel, and the performance was very impressive. I’m really looking forward to it. @byshiue

Oct 18 '24 11:10 pjs102793

The FP8 FMHA on SM89 (L20) is on going. So, you could only enable FP8 GEMM on L20 now.

Want to know if this feature is ready.

May 29 '25 02:05 wujinzhong

hi, what nvidia gpus besides h100 and h200 have fp8 support? and does fp8 also work with a modernBERT model using tensorRT?

May 29 '25 05:05 geraldstanje1