text-generation-inference
text-generation-inference copied to clipboard
Continuous Batching Causes some Generations to Cut Short
System Info
text-generation-inference: 0.6.0 Vicuna 13B Bottlerocket OS A10 GPU EKS
Information
- [X] Docker
- [ ] The CLI directly
Tasks
- [X] An officially supported command
- [ ] My own modifications
Reproduction
We have a large number of concurrent users and the continuous batching algorithm seems to cut some messages short during high traffic. I am trying to diagnose what is causing this since there aren't any stop tokens that are present from my analysis (the generation stops suddenly in the middle of a sentence). I am wondering if there is something within the token budgeting algorithm which might cause some generations to see less tokens generated than others.
Expected behavior
Generations continue until EOS token is generated of max new tokens is reached.