tensorrtllm_backend icon indicating copy to clipboard operation
tensorrtllm_backend copied to clipboard

Inference server stalling

Open siddhatiwari opened this issue 6 months ago • 4 comments

System Info

  • tensorrtllm_backend built using Dockerfile.trt_llm_backend
  • main branch tesnorrt llm (0.13.0.dev20240813000)
  • 8xH100 SXM
  • Driver Version: 535.129.03
  • CUDA Version: 12.5

After roughly 30 seconds of inference requests, the inference server stalls, not responding to any requests. There are no error codes or crashes visible in logs. The server is using decoupled mode with dynamic_batching.

These are the parameters for the engine used:

python3 ../quantization/quantize.py --model_dir ./llama2-70b \
                                   --dtype float16 \
                                   --qformat fp8 \
                                   --kv_cache_dtype fp8 \
                                   --output_dir ./llama2-70b-out \
                                   --calib_size 512 \
                                   --tp_size 2

CUDA_VISIBLE_DEVICES=6,7 trtllm-build --checkpoint_dir ./llama2-70b-out  \
             --output_dir ./llama2-70b-eng \
             --gemm_plugin float16 \
             --max_batch_size 160 \
             --max_input_len 2048 \
             --max_seq_len 2560 \
             --context_fmha enable \
             --gpt_attention_plugin float16 \
             --paged_kv_cache enable \
             --remove_input_padding enable \
             --max_num_tokens 65536 \
             --enable_xqa enable \
             --bert_context_fmha_fp32_acc enable \
             --workers 2 \
             --multiple_profiles enable \
             --use_fp8_context_fmha enable

Who can help?

No response

Information

  • [X] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [ ] My own task or dataset (give details below)

Reproduction

  1. Build the inference server docker image
  2. Build the llama 70b engine
  3. Start the server serving the engine
  4. Send high requests per second with 2k context length

Expected behavior

Inference server doesn't stall

actual behavior

Inference server stalls

additional notes

Initial requests complete successfully, so not sure why it stalls afterwards

siddhatiwari avatar Aug 17 '24 06:08 siddhatiwari