TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

multi_block_mode enable runtime crash

Open siddhatiwari opened this issue 10 months ago • 2 comments

System Info

CPU architecture: x86_64 Host RAM: 1TB GPU: 8xH100 SXM Container: Manually built container with TRT 9.3 Dockerfile.trt_llm_backend (nvcr.io/nvidia/tritonserver:24.03-trtllm-python-py3 doesn't work for TRT LLM main branch?) TRT LLM v0.9 main branch (https://github.com/NVIDIA/TensorRT-LLM/commit/850b6fa1e710d25769f2b560d897d2bd424a645e) Driver Version: 535.161.07 CUDA Version: 12.2 OS: Ubuntu 22.04

Who can help?

@byshiue

Information

  • [ ] The official example scripts
  • [X] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [X] My own task or dataset (give details below)

Reproduction

Run the following engine with enable_chunked_context=True and enable_kv_cache_reuse=True.

python3 convert_checkpoint.py \
  --model_dir ./llama-70b \
  --output_dir ./llama-70b_tp4 \
  --dtype float16 \
  --use_weight_only \
  --weight_only_precision int8 \
  --tp_size 4


trtllm-build \
  --checkpoint_dir ./llama-70b_tp4  \
  --output_dir engines/llama-70b-2 \
  --gemm_plugin float16 \
  --max_batch_size 192 \
  --max_input_len 2048 \
  --max_output_len 384 \
  --gpt_attention_plugin float16 \
  --paged_kv_cache enable \
  --remove_input_padding enable \
  --multi_block_mode enable \
  --max_num_tokens 393216 \
  --context_fmha enable \
  --use_paged_context_fmha enable \
  --enable_xqa enable \
  --workers 4 \
  --use_custom_all_reduce enable \
  --opt_num_tokens 192

Expected behavior

Engine should run normally and not crash at runtime

actual behavior

Engine crashes at runtime after a few requests

additional notes

multi_block_mode=disable fixes the issue. This may also be related to enabling chunked context

siddhatiwari avatar Mar 29 '24 16:03 siddhatiwari