tensorrtllm_backend Inference server stalling

Inference server stalling

Open siddhatiwari opened this issue 6 months ago • 4 comments

System Info

tensorrtllm_backend built using Dockerfile.trt_llm_backend
main branch tesnorrt llm (0.13.0.dev20240813000)
8xH100 SXM
Driver Version: 535.129.03
CUDA Version: 12.5

After roughly 30 seconds of inference requests, the inference server stalls, not responding to any requests. There are no error codes or crashes visible in logs. The server is using decoupled mode with dynamic_batching.

These are the parameters for the engine used:

python3 ../quantization/quantize.py --model_dir ./llama2-70b \
                                   --dtype float16 \
                                   --qformat fp8 \
                                   --kv_cache_dtype fp8 \
                                   --output_dir ./llama2-70b-out \
                                   --calib_size 512 \
                                   --tp_size 2

CUDA_VISIBLE_DEVICES=6,7 trtllm-build --checkpoint_dir ./llama2-70b-out  \
             --output_dir ./llama2-70b-eng \
             --gemm_plugin float16 \
             --max_batch_size 160 \
             --max_input_len 2048 \
             --max_seq_len 2560 \
             --context_fmha enable \
             --gpt_attention_plugin float16 \
             --paged_kv_cache enable \
             --remove_input_padding enable \
             --max_num_tokens 65536 \
             --enable_xqa enable \
             --bert_context_fmha_fp32_acc enable \
             --workers 2 \
             --multiple_profiles enable \
             --use_fp8_context_fmha enable

Who can help?

No response

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Build the inference server docker image
Build the llama 70b engine
Start the server serving the engine
Send high requests per second with 2k context length

Expected behavior

Inference server doesn't stall

actual behavior

Inference server stalls

additional notes

Initial requests complete successfully, so not sure why it stalls afterwards

Aug 17 '24 06:08 siddhatiwari

tensorrtllm_backend tensorrtllm_backend copied to clipboard

Inference server stalling

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

tensorrtllm_backend
tensorrtllm_backend copied to clipboard