text-generation-inference Queue size increases indefinitely

Queue size increases indefinitely

Open QLutz opened this issue 7 months ago • 2 comments

System Info

OS version: Linux Model being used (curl 127.0.0.1:8080/info | jq): TheBloke/Nous-Hermes-2-Mixtral-8x7B-DPO-AWQ Hardware used (GPUs, how many, on which cloud) (nvidia-smi): 1xL40S The current version being used: 2.0.4

Information

[X] Docker
[ ] The CLI directly

Tasks

[X] An officially supported command
[ ] My own modifications

Reproduction

Launch TGI using max_total_tokens=max_batch_prefill_tokens=16384; max_input_length=16383; quantize=awq. After making a few hundred requests, the pod returns empty packets and only a few seconds after the request a has been made. Monitoring reveals that tgi_queue_size increases steadily but does not ever go down.

Expected behavior

No stutters.

Jul 05 '24 12:07 QLutz

text-generation-inference text-generation-inference copied to clipboard

Queue size increases indefinitely

System Info

Information

Tasks

Reproduction

Expected behavior

text-generation-inference
text-generation-inference copied to clipboard