TensorRT-LLM
TensorRT-LLM copied to clipboard
multi_block_mode enable runtime crash
System Info
CPU architecture: x86_64 Host RAM: 1TB GPU: 8xH100 SXM Container: Manually built container with TRT 9.3 Dockerfile.trt_llm_backend (nvcr.io/nvidia/tritonserver:24.03-trtllm-python-py3 doesn't work for TRT LLM main branch?) TRT LLM v0.9 main branch (https://github.com/NVIDIA/TensorRT-LLM/commit/850b6fa1e710d25769f2b560d897d2bd424a645e) Driver Version: 535.161.07 CUDA Version: 12.2 OS: Ubuntu 22.04
Who can help?
@byshiue
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - [X] My own task or dataset (give details below)
Reproduction
Run the following engine with enable_chunked_context=True
and enable_kv_cache_reuse=True
.
python3 convert_checkpoint.py \
--model_dir ./llama-70b \
--output_dir ./llama-70b_tp4 \
--dtype float16 \
--use_weight_only \
--weight_only_precision int8 \
--tp_size 4
trtllm-build \
--checkpoint_dir ./llama-70b_tp4 \
--output_dir engines/llama-70b-2 \
--gemm_plugin float16 \
--max_batch_size 192 \
--max_input_len 2048 \
--max_output_len 384 \
--gpt_attention_plugin float16 \
--paged_kv_cache enable \
--remove_input_padding enable \
--multi_block_mode enable \
--max_num_tokens 393216 \
--context_fmha enable \
--use_paged_context_fmha enable \
--enable_xqa enable \
--workers 4 \
--use_custom_all_reduce enable \
--opt_num_tokens 192
Expected behavior
Engine should run normally and not crash at runtime
actual behavior
Engine crashes at runtime after a few requests
additional notes
multi_block_mode=disable
fixes the issue. This may also be related to enabling chunked context