text-generation-inference
text-generation-inference copied to clipboard
Queue size increases indefinitely
System Info
OS version: Linux Model being used (curl 127.0.0.1:8080/info | jq): TheBloke/Nous-Hermes-2-Mixtral-8x7B-DPO-AWQ Hardware used (GPUs, how many, on which cloud) (nvidia-smi): 1xL40S The current version being used: 2.0.4
Information
- [X] Docker
- [ ] The CLI directly
Tasks
- [X] An officially supported command
- [ ] My own modifications
Reproduction
Launch TGI using max_total_tokens
=max_batch_prefill_tokens
=16384; max_input_length
=16383; quantize
=awq.
After making a few hundred requests, the pod returns empty packets and only a few seconds after the request a
has been made.
Monitoring reveals that tgi_queue_size
increases steadily but does not ever go down.
Expected behavior
No stutters.