TensorRT-LLM
TensorRT-LLM copied to clipboard
Chunked context incomplete outputs
System Info
CPU architecture: x86_64 Host RAM: 1TB GPU: 8xH100 SXM Container: Manually built container with TRT 9.3 Dockerfile.trt_llm_backend (nvcr.io/nvidia/tritonserver:24.03-trtllm-python-py3 doesn't work for TRT LLM main branch?) TRT LLM v0.9 main branch (https://github.com/NVIDIA/TensorRT-LLM/commit/850b6fa1e710d25769f2b560d897d2bd424a645e) Driver Version: 535.161.07 CUDA Version: 12.2 OS: Ubuntu 22.04
Who can help?
@byshiue @Shixiaowei02
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - [X] My own task or dataset (give details below)
Reproduction
Running engine at high queries per second causes errors and incomplete output. With very low max_num_tokens
, outputs are incomplete even at very low queries per second.
Build llama engine with use_paged_context_fmha=enable
and run with enable_chunked_context=True
. Issue occurs similarly with different size llama models, and different max_num_tokens.
python3 convert_checkpoint.py \
--model_dir ./llama-70b \
--output_dir ./llama-70b_tp2 \
--dtype float16 \
--tp_size 2
trtllm-build \
--checkpoint_dir ./llama-70b_tp2 \
--output_dir engines/llama-70b-1 \
--gemm_plugin float16 \
--max_batch_size 256 \
--max_input_len 2048 \
--max_output_len 512 \
--gpt_attention_plugin float16 \
--paged_kv_cache enable \
--remove_input_padding enable \
--multi_block_mode disable \
--max_num_tokens 8192 \
--context_fmha enable \
--use_paged_context_fmha enable \
--context_fmha_fp32_acc enable \
--use_fused_mlp \
--enable_xqa enable \
--use_custom_all_reduce enable
Expected behavior
All outputs should be complete/not truncated, even under high load. Completion latency should increase under high load, but outputs shouldn't be affected.
actual behavior
Errors for many requests during inference, which return incomplete/truncated outputs:
[TensorRT-LLM][ERROR] Encountered error for requestId 7692: Encountered an error in forward function: slice 2044 exceeds buffer size 420
{"asctime": "2024-03-27 17:28:07,904", "levelname": "ERROR", "message": "Exception while reading stream response: {\"status\": \"error\", \"message\": \"in ensemble 'ensemble', Encountered error for requestId 7657: Encountered an error in forward function: slice 2044 exceeds buffer size 420\"}", "exc_info": "Traceback (most recent call last):\n File \"/app/model_wrapper.py\", line 257, in write_response_to_queue\n async for chunk in generator:\n File \"/app/model/model.py\", line 116, in generate\n async for i in result_iterator:\n File \"/packages/client.py\", line 181, in infer\n raise Exception(error_message)\nException: {\"status\": \"error\", \"message\": \"in ensemble 'ensemble', Encountered error for requestId 7657: Encountered an error in forward function: slice 2044 exceeds buffer size 420\"}"}
additional notes
I tried this with multiple llama based models and got the same error. enable_kv_cache_reuse=True seems to make the errors happen more frequently.
Thank you for finding the issue. I have started troubleshooting.
Any updates on this, I am also facing the same buffer size issue
Locating tensorrt_llm::batch_manager::TrtGptModelInflightBatching::TokenPtr tensorrt_llm::batch_manager::TrtGptModelInflightBatching::decoderStepAsync(tensorrt_llm::batch_manager::RequestTable&, const ReqIdsVec&, const ReqIdsVec&) crashes, but the code may be closed source
Any updates on this? It would be great to see the full speedup from this feature https://github.com/NVIDIA/TensorRT-LLM/issues/317#issuecomment-1810841752
Could you try this guide, which uses chunked context to run long context?