text-generation-inference Continuous Batching Causes some Generations to Cut Short

Continuous Batching Causes some Generations to Cut Short

Open sam-h-bean opened this issue 2 years ago • 0 comments

System Info

text-generation-inference: 0.6.0 Vicuna 13B Bottlerocket OS A10 GPU EKS

Information

[X] Docker
[ ] The CLI directly

Tasks

[X] An officially supported command
[ ] My own modifications

Reproduction

We have a large number of concurrent users and the continuous batching algorithm seems to cut some messages short during high traffic. I am trying to diagnose what is causing this since there aren't any stop tokens that are present from my analysis (the generation stops suddenly in the middle of a sentence). I am wondering if there is something within the token budgeting algorithm which might cause some generations to see less tokens generated than others.

Expected behavior

Generations continue until EOS token is generated of max new tokens is reached.

May 20 '23 15:05 sam-h-bean

text-generation-inference text-generation-inference copied to clipboard

Continuous Batching Causes some Generations to Cut Short

System Info

Information

Tasks

Reproduction

Expected behavior

text-generation-inference
text-generation-inference copied to clipboard