TensorRT-LLM Performance for Llama-2-7b in fp8 is lower than benchmark

System Info

System Info

CPU architecture: x86_64

GPU: NVIDIA 4090 24GB

TensorRT-LLM: 0.10.0.dev2024042300

Triton Inference Server: r24.02

OS: Ubuntu 22.04

Who can help?

Quantization: @Tracin Performance: @kaiyux

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

I hope to use w8a8 to speed up prefill infer time of LLM. I obtained satisfactory results in the benchmark phase, but the performance was decrease after using engine. test case: input_len=1024, batch=1,2,4,8, output_len=1 Reproduce steps:

1. run benchmark:

python benchmark.py -m llama_7b --mode plugin --quantization fp8 --max_input_len 1024 --max_batch_size 8  --batch_size "1;2;4;8"  --input_output_len "1024,1"

2. run benchmark with engine:

python quantize.py --model_dir ${WORK_HF_DIR} \
                                   --dtype float16 \
                                   --qformat fp8 \
                                   --output_dir ${WORK_CKPT_DIR} \
                                   --calib_size 512

trtllm-build --checkpoint_dir ${WORK_CKPT_DIR} --output_dir ${WORK_ENGINE_DIR}   --gemm_plugin float16 --gpt_attention_plugin float16  --max_input_len=1024 --max_batch_size=8

python benchmark.py -m llama_7b --engine_dir ${WORK_ENGINE_DIR} --mode plugin  --max_input_len 1024 --max_batch_size 8  --batch_size "1;2;4;8"  --input_output_len "1024,1"

Expected behavior

Can I achieve benchmark-level performance and whether any parameters need to be adjusted. Confirm which performance result is correct and I want to know how much improvement w8a8 can achieve compared to fp16 in this scenario.

actual behavior

fp8_benchmark fp8_engine_benchmark

There is not much difference in performance whether fp16 uses engine or not. But fp8 uses engine delay increased by 1.6 times

additional notes

Hope to use w8a8 to improve the performance of prefill stage. Looking forward to giving some suggestions

Apr 26 '24 07:04 SidaZh

Might you try adding --kv_cache_dtype fp8 when you convert the checkpoint for the with engine case?

Apr 30 '24 02:04 byshiue

@byshiue Thanks for your reply. I tried adding parameters--kv_cache_dtype fp8, but the performance didn't seem to improve.

python quantize.py --model_dir ${WORK_HF_DIR} \
                                   --dtype float16 \
                                   --qformat fp8 \
                                   --output_dir ${WORK_CKPT_DIR} \
                                   --calib_size 512

python quantize.py --model_dir ${WORK_HF_DIR} \
                                   --dtype float16 \
                                   --qformat fp8 \
                                   --kv_cache_dtype fp8 \
                                   --output_dir ${WORK_CKPT_DIR} \
                                   --calib_size 512

Apr 30 '24 06:04 SidaZh

config.json

Apr 30 '24 07:04 SidaZh

Hi @SidaZh , we can disable paged kv cache when building engine, trtllm can chose faster fp8 gemm kernels with better opt_shape and get similar performance as benchmark.py.

May 17 '24 04:05 Jackch-NV

Hi @SidaZh , we can disable paged kv cache when building engine, trtllm can chose faster fp8 gemm kernels with better opt_shape and get similar performance as benchmark.py.

It works. Thanks for your suggestion.

May 21 '24 02:05 SidaZh

TensorRT-LLM TensorRT-LLM copied to clipboard

Performance for Llama-2-7b in fp8 is lower than benchmark

System Info

System Info

Who can help?

Information

Tasks

Reproduction

1. run benchmark:

2. run benchmark with engine:

Expected behavior

actual behavior

additional notes

TensorRT-LLM
TensorRT-LLM copied to clipboard