exo
exo copied to clipboard
Max tokens limits responses on a given request_id
Right now, max_generate_tokens option limits the total number of tokens a given request can return.
The desired behaviour is that it should limit the number of tokens on a given generation until we either hit the length limit or the generation limit.
We should refactor buffered token output broadcasts to send only the deltas (also fixes the n^2 bandwidth problem), which should make this easier.