Limiting the tokens per minute doesn't work
I want to limit the tokens per minute (tpm) to under 30000 tokens to avoid errors like:
"litellm.exceptions.RateLimitError: litellm.RateLimitError: RateLimitError: OpenAIException - Request too large for gpt-5 in organization org-123456 on tokens per min (TPM): Limit 30000, Requested 36055. The input or output tokens must be reduced in order to run successfully."
I set the input and output tpm limit for the utility and chat models as low as 1000 tokens each, but I am still running into the error. Is there a bug? Or is this a feature request: How can a tpm and rpm limit be adjusted for an API key?
i believe the devs are aware if some of this. a lot depends on each individual unique setup..
what is the context length set at? space for chat history? requests pee minute?
the math behind tokens per minute, context window, percent of window allocated... all allow for some confusion and times where it doesn't add up
and not just over by a few tokens (yes good to set at a bit lower by 5 or 10 percent)
requests per minute can help until this gets fixed
the per-minute calculations don't take everything into account fully, and the requests per minute combined with context length and context window help break it down and control it. like hacking it up and using duct tape for the time being.
important question...
how large is your system prompt?
open a new chat. say hi. wait for response. ckick to open context window. how many tokens does it show in a fresh chat conversation?
subtract that to get your usable space to do the math