tensorrtllm_backend
tensorrtllm_backend copied to clipboard
Inference server stalling
System Info
- tensorrtllm_backend built using Dockerfile.trt_llm_backend
- main branch tesnorrt llm (0.13.0.dev20240813000)
- 8xH100 SXM
- Driver Version: 535.129.03
- CUDA Version: 12.5
After roughly 30 seconds of inference requests, the inference server stalls, not responding to any requests. There are no error codes or crashes visible in logs. The server is using decoupled mode with dynamic_batching.
These are the parameters for the engine used:
python3 ../quantization/quantize.py --model_dir ./llama2-70b \
--dtype float16 \
--qformat fp8 \
--kv_cache_dtype fp8 \
--output_dir ./llama2-70b-out \
--calib_size 512 \
--tp_size 2
CUDA_VISIBLE_DEVICES=6,7 trtllm-build --checkpoint_dir ./llama2-70b-out \
--output_dir ./llama2-70b-eng \
--gemm_plugin float16 \
--max_batch_size 160 \
--max_input_len 2048 \
--max_seq_len 2560 \
--context_fmha enable \
--gpt_attention_plugin float16 \
--paged_kv_cache enable \
--remove_input_padding enable \
--max_num_tokens 65536 \
--enable_xqa enable \
--bert_context_fmha_fp32_acc enable \
--workers 2 \
--multiple_profiles enable \
--use_fp8_context_fmha enable
Who can help?
No response
Information
- [X] The official example scripts
- [ ] My own modified scripts
Tasks
- [ ] An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
- Build the inference server docker image
- Build the llama 70b engine
- Start the server serving the engine
- Send high requests per second with 2k context length
Expected behavior
Inference server doesn't stall
actual behavior
Inference server stalls
additional notes
Initial requests complete successfully, so not sure why it stalls afterwards