TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

int8 lower performance than fp16

Open siddhatiwari opened this issue 10 months ago • 2 comments

System Info

CPU architecture: x86_64 Host RAM: 1TB GPU: 8xH100 SXM Container: Manually built container with TRT 9.3 Dockerfile.trt_llm_backend TRT LLM v0.9 main branch (https://github.com/NVIDIA/TensorRT-LLM/commit/850b6fa1e710d25769f2b560d897d2bd424a645e) Driver Version: 535.161.07 CUDA Version: 12.2 OS: Ubuntu 22.04

Who can help?

@byshiue

Information

  • [ ] The official example scripts
  • [X] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [X] My own task or dataset (give details below)

Reproduction

Build the same model in fp16 and int8 weight only quantization and enable_chunked_context=True

python3 convert_checkpoint.py \
  --model_dir ./llama-70b \
  --output_dir ./llama-70b_tp4 \
  --dtype float16 \
  --use_weight_only \
  --weight_only_precision int8 \
  --tp_size 4


trtllm-build \
  --checkpoint_dir ./llama-70b_tp4  \
  --output_dir engines/llama-70b-2 \
  --gemm_plugin float16 \
  --max_batch_size 192 \
  --max_input_len 2048 \
  --max_output_len 384 \
  --gpt_attention_plugin float16 \
  --paged_kv_cache enable \
  --remove_input_padding enable \
  --multi_block_mode disable \
  --max_num_tokens 393216 \
  --context_fmha enable \
  --use_paged_context_fmha enable \
  --enable_xqa enable \
  --workers 4 \
  --use_custom_all_reduce enable \
  --opt_num_tokens 192

Expected behavior

int8 performance should be better than fp16

actual behavior

int8 latency is 3x higher than fp16

additional notes

The issue may be related to enabling chunked context

siddhatiwari avatar Mar 29 '24 16:03 siddhatiwari