TensorRT-LLM
TensorRT-LLM copied to clipboard
int8 lower performance than fp16
System Info
CPU architecture: x86_64 Host RAM: 1TB GPU: 8xH100 SXM Container: Manually built container with TRT 9.3 Dockerfile.trt_llm_backend TRT LLM v0.9 main branch (https://github.com/NVIDIA/TensorRT-LLM/commit/850b6fa1e710d25769f2b560d897d2bd424a645e) Driver Version: 535.161.07 CUDA Version: 12.2 OS: Ubuntu 22.04
Who can help?
@byshiue
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - [X] My own task or dataset (give details below)
Reproduction
Build the same model in fp16 and int8 weight only quantization and enable_chunked_context=True
python3 convert_checkpoint.py \
--model_dir ./llama-70b \
--output_dir ./llama-70b_tp4 \
--dtype float16 \
--use_weight_only \
--weight_only_precision int8 \
--tp_size 4
trtllm-build \
--checkpoint_dir ./llama-70b_tp4 \
--output_dir engines/llama-70b-2 \
--gemm_plugin float16 \
--max_batch_size 192 \
--max_input_len 2048 \
--max_output_len 384 \
--gpt_attention_plugin float16 \
--paged_kv_cache enable \
--remove_input_padding enable \
--multi_block_mode disable \
--max_num_tokens 393216 \
--context_fmha enable \
--use_paged_context_fmha enable \
--enable_xqa enable \
--workers 4 \
--use_custom_all_reduce enable \
--opt_num_tokens 192
Expected behavior
int8 performance should be better than fp16
actual behavior
int8 latency is 3x higher than fp16
additional notes
The issue may be related to enabling chunked context