text-generation-inference icon indicating copy to clipboard operation
text-generation-inference copied to clipboard

Queue size increases indefinitely

Open QLutz opened this issue 7 months ago • 2 comments

System Info

OS version: Linux Model being used (curl 127.0.0.1:8080/info | jq): TheBloke/Nous-Hermes-2-Mixtral-8x7B-DPO-AWQ Hardware used (GPUs, how many, on which cloud) (nvidia-smi): 1xL40S The current version being used: 2.0.4

Information

  • [X] Docker
  • [ ] The CLI directly

Tasks

  • [X] An officially supported command
  • [ ] My own modifications

Reproduction

Launch TGI using max_total_tokens=max_batch_prefill_tokens=16384; max_input_length=16383; quantize=awq. After making a few hundred requests, the pod returns empty packets and only a few seconds after the request a has been made. Monitoring reveals that tgi_queue_size increases steadily but does not ever go down.

Expected behavior

No stutters.

QLutz avatar Jul 05 '24 12:07 QLutz