Max tokens limits responses on a given request_id

Open AlexCheema opened this issue 1 year ago • 0 comments

Right now, max_generate_tokens option limits the total number of tokens a given request can return. The desired behaviour is that it should limit the number of tokens on a given generation until we either hit the length limit or the generation limit.

We should refactor buffered token output broadcasts to send only the deltas (also fixes the n^2 bandwidth problem), which should make this easier.

Aug 07 '24 11:08 AlexCheema