multi_block_mode enable runtime crash
System Info
CPU architecture: x86_64 Host RAM: 1TB GPU: 8xH100 SXM Container: Manually built container with TRT 9.3 Dockerfile.trt_llm_backend (nvcr.io/nvidia/tritonserver:24.03-trtllm-python-py3 doesn't work for TRT LLM main branch?) TRT LLM v0.9 main branch (https://github.com/NVIDIA/TensorRT-LLM/commit/850b6fa1e710d25769f2b560d897d2bd424a645e) Driver Version: 535.161.07 CUDA Version: 12.2 OS: Ubuntu 22.04
Who can help?
@byshiue
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [X] My own task or dataset (give details below)
Reproduction
Run the following engine with enable_chunked_context=True and enable_kv_cache_reuse=True.
python3 convert_checkpoint.py \
--model_dir ./llama-70b \
--output_dir ./llama-70b_tp4 \
--dtype float16 \
--use_weight_only \
--weight_only_precision int8 \
--tp_size 4
trtllm-build \
--checkpoint_dir ./llama-70b_tp4 \
--output_dir engines/llama-70b-2 \
--gemm_plugin float16 \
--max_batch_size 192 \
--max_input_len 2048 \
--max_output_len 384 \
--gpt_attention_plugin float16 \
--paged_kv_cache enable \
--remove_input_padding enable \
--multi_block_mode enable \
--max_num_tokens 393216 \
--context_fmha enable \
--use_paged_context_fmha enable \
--enable_xqa enable \
--workers 4 \
--use_custom_all_reduce enable \
--opt_num_tokens 192
Expected behavior
Engine should run normally and not crash at runtime
actual behavior
Engine crashes at runtime after a few requests
additional notes
multi_block_mode=disable fixes the issue. This may also be related to enabling chunked context
@siddhatiwari it is a known issue due to XQA kernels, and we are working on fixing it.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."
This issue was closed because it has been stalled for 15 days with no activity.