does NVIDIA L20 GPUs support FP8 quantization?
System Info
CPU architecture: x86_64 Host RAM: 1TB GPU: 2xL20 SXM Container: Manually built container with TRT 9.3 Dockerfile.trt_llm_backend TensorRT-LLM version: 0.12.0.dev2024070200" Driver Version: 550.54.15 CUDA Version: 12.4 OS: Ubuntu 22.04
[TensOrRT-LLM]ГERROR]tensort llm.:comon::TlmException: [TensorRT-LM][ERR] Assertion failed: Fp8 FMHA cannot be enabled on pre-Hopper Arcdh.
CUDA_VISIBLE_DEVICES=0,1 python ../quantization/quantize.py \ --model_dir /nvme0/hub/modelscope/baichuan-inc/Baichuan2-7B-Chat \ --dtype bfloat16 \ --qformat fp8 \ --calib_dataset /nvme0/ai/fp8/TensorRT-LLM/cnn_dailymail \ --output_dir ./quantized_fp8 \ --calib_size 256
CUDA_VISIBLE_DEVICES=0,1 trtllm-build --checkpoint_dir ./quantized_fp8/ \ --output_dir ./quantized_fp8-1-gpu/ \ --use_fp8_context_fmha enable \ --gemm_plugin bfloat16
ERROR:
Who can help?
No response
Information
- [ ] The official example scripts
- [ ] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
CUDA_VISIBLE_DEVICES=0,1 python ../quantization/quantize.py \ --model_dir /nvme0/hub/modelscope/baichuan-inc/Baichuan2-7B-Chat \ --dtype bfloat16 \ --qformat fp8 \ --calib_dataset /nvme0/ai/fp8/TensorRT-LLM/cnn_dailymail \ --output_dir ./quantized_fp8 \ --calib_size 256
CUDA_VISIBLE_DEVICES=0,1 trtllm-build --checkpoint_dir ./quantized_fp8/ \ --output_dir ./quantized_fp8-1-gpu/ \ --use_fp8_context_fmha enable \ --gemm_plugin bfloat16
@Tracin could you please have a look? Thanks
why you use gemm_plugin with bfloat16 and not fp8?
also they mention to disable it: https://nvidia.github.io/TensorRT-LLM/performance/perf-best-practices.html#gemm-plugin
also where you see that NVIDIA L20 GPU supports fp8?
@jinweida Looks like FP8 FMHA can not be supported on L20, please remove --use_fp8_context_fmha enable from your command.
@Tracin where do you see that NVIDIA L20 GPU supports fp8? he uses also: --qformat fp8
spec:
edit: just saw there is fp8...
@Tracin The dealer say L20 GPU supports fp8
@jinweida Yeah, I mean you can still use FP8 gemm on L20 if you remove --use_fp8_context_fmha enable. FP8 FMHA is a new feature and not cover L20 for now.
how do I accelerate fp8 with L20? @Tracin
how do I accelerate fp8 with L20? @Tracin
If you mean accelerate LLM on L20 by FP8 gemm, you are doing in the correct way.
python ../quantization/quantize.py \ --model_dir /nvme0/hub/modelscope/baichuan-inc/Baichuan2-7B-Chat \ --dtype bfloat16 \ --qformat fp8 \ --calib_dataset /nvme0/ai/fp8/TensorRT-LLM/cnn_dailymail \ --output_dir ./quantized_fp8 \ --calib_size 256
CUDA_VISIBLE_DEVICES=0,1 trtllm-build --checkpoint_dir ./quantized_fp8/ \ --output_dir ./quantized_fp8-1-gpu/
The FP8 FMHA on SM89 (L20) is on going. So, you could only enable FP8 GEMM on L20 now.
The FP8 FMHA on SM89 (L20) is on going. So, you could only enable FP8 GEMM on L20 now.
Do you have an estimate for when the implementation will be ready? I ran benchmarks for FP8 Flash Attention V2 on the L40S with Triton Kernel, and the performance was very impressive. I’m really looking forward to it. @byshiue
The FP8 FMHA on SM89 (L20) is on going. So, you could only enable FP8 GEMM on L20 now.
Want to know if this feature is ready.
hi, what nvidia gpus besides h100 and h200 have fp8 support? and does fp8 also work with a modernBERT model using tensorRT?