TensorRT-LLM
TensorRT-LLM copied to clipboard
Performance for Llama-2-7b in fp8 is lower than benchmark
System Info
System Info
- CPU architecture:
x86_64
- GPU: NVIDIA 4090 24GB
- TensorRT-LLM: 0.10.0.dev2024042300
- Triton Inference Server:
r24.02
- OS: Ubuntu 22.04
Who can help?
Quantization: @Tracin Performance: @kaiyux
Information
- [X] The official example scripts
- [ ] My own modified scripts
Tasks
- [X] An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
I hope to use w8a8 to speed up prefill infer time of LLM. I obtained satisfactory results in the benchmark phase, but the performance was decrease after using engine. test case: input_len=1024, batch=1,2,4,8, output_len=1 Reproduce steps:
1. run benchmark:
python benchmark.py -m llama_7b --mode plugin --quantization fp8 --max_input_len 1024 --max_batch_size 8 --batch_size "1;2;4;8" --input_output_len "1024,1"
2. run benchmark with engine:
python quantize.py --model_dir ${WORK_HF_DIR} \
--dtype float16 \
--qformat fp8 \
--output_dir ${WORK_CKPT_DIR} \
--calib_size 512
trtllm-build --checkpoint_dir ${WORK_CKPT_DIR} --output_dir ${WORK_ENGINE_DIR} --gemm_plugin float16 --gpt_attention_plugin float16 --max_input_len=1024 --max_batch_size=8
python benchmark.py -m llama_7b --engine_dir ${WORK_ENGINE_DIR} --mode plugin --max_input_len 1024 --max_batch_size 8 --batch_size "1;2;4;8" --input_output_len "1024,1"
Expected behavior
Can I achieve benchmark-level performance and whether any parameters need to be adjusted. Confirm which performance result is correct and I want to know how much improvement w8a8 can achieve compared to fp16 in this scenario.
actual behavior
There is not much difference in performance whether fp16 uses engine or not. But fp8 uses engine delay increased by 1.6 times
additional notes
Hope to use w8a8 to improve the performance of prefill stage. Looking forward to giving some suggestions
Might you try adding --kv_cache_dtype fp8
when you convert the checkpoint for the with engine case
?
@byshiue Thanks for your reply. I tried adding parameters--kv_cache_dtype fp8
, but the performance didn't seem to improve.
python quantize.py --model_dir ${WORK_HF_DIR} \
--dtype float16 \
--qformat fp8 \
--output_dir ${WORK_CKPT_DIR} \
--calib_size 512
python quantize.py --model_dir ${WORK_HF_DIR} \
--dtype float16 \
--qformat fp8 \
--kv_cache_dtype fp8 \
--output_dir ${WORK_CKPT_DIR} \
--calib_size 512
config.json
Hi @SidaZh , we can disable paged kv cache when building engine, trtllm can chose faster fp8 gemm kernels with better opt_shape and get similar performance as benchmark.py.
Hi @SidaZh , we can disable paged kv cache when building engine, trtllm can chose faster fp8 gemm kernels with better opt_shape and get similar performance as benchmark.py.
It works. Thanks for your suggestion.