litellm icon indicating copy to clipboard operation
litellm copied to clipboard

[Bug]: Router not respecting TPM limits in concurrent async calls

Open whitead opened this issue 1 year ago • 29 comments

What happened?

I'm trying to test if Routers respect TPMs on models when called async and it doesn't seem to be working. Here are the steps to reproduce:

  1. Make an OpenAI project and set TPM/RPM to match what is in script below. I used 500/30000 to represent tier 1 account - but you can lower to reduce cost of repro
  2. Make key for that project and set that key to OPENAI_API_KEY
  3. Execute script
import asyncio
import random

from litellm import Router

pre_fill = """


Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc ut finibus massa. Quisque a magna magna. Quisque neque diam, varius sit amet tellus eu, elementum fermentum sapien. Integer ut erat eget arcu rutrum blandit. Morbi a metus purus. Nulla porta, urna at finibus malesuada, velit ante suscipit orci, vitae laoreet dui ligula ut augue. Cras elementum pretium dui, nec luctus nulla aliquet ut. Nam faucibus, diam nec semper interdum, nisl nisi viverra nulla, vitae sodales elit ex a purus. Donec tristique malesuada lobortis. Donec posuere iaculis nisl, vitae accumsan libero dignissim dignissim. Suspendisse finibus leo et ex mattis tempor. Praesent at nisl vitae quam egestas lacinia. Donec in justo non erat aliquam accumsan sed vitae ex. Vivamus gravida diam vel ipsum tincidunt dignissim.

Cras vitae efficitur tortor. Curabitur vel erat mollis, euismod diam quis, consequat nibh. Ut vel est eu nulla euismod finibus. Aliquam euismod at risus quis dignissim. Integer non auctor massa. Nullam vitae aliquet mauris. Etiam risus enim, dignissim ut volutpat eget, pulvinar ac augue. Mauris elit est, ultricies vel convallis at, rhoncus nec elit. Aenean ornare maximus orci, ut maximus felis cursus venenatis. Nulla facilisi.

Maecenas aliquet ante massa, at ullamcorper nibh dictum quis. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Quisque id egestas justo. Suspendisse fringilla in massa in consectetur. Quisque scelerisque egestas lacus at posuere. Vestibulum dui sem, bibendum vehicula ultricies vel, blandit id nisi. Curabitur ullamcorper semper metus, vitae commodo magna. Nulla mi metus, suscipit in neque vitae, porttitor pharetra erat. Vestibulum libero velit, congue in diam non, efficitur suscipit diam. Integer arcu velit, fermentum vel tortor sit amet, venenatis rutrum felis. Donec ultricies enim sit amet iaculis mattis.

Integer at purus posuere, malesuada tortor vitae, mattis nibh. Mauris ex quam, tincidunt et fermentum vitae, iaculis non elit. Nullam dapibus non nisl ac sagittis. Duis lacinia eros iaculis lectus consectetur vehicula. Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. Interdum et malesuada fames ac ante ipsum primis in faucibus. Ut cursus semper est, vel interdum turpis ultrices dictum. Suspendisse posuere lorem et accumsan ultrices. Duis sagittis bibendum consequat. Ut convallis vestibulum enim, non dapibus est porttitor et. Quisque suscipit pulvinar turpis, varius tempor turpis. Vestibulum semper dui nunc, vel vulputate elit convallis quis. Fusce aliquam enim nulla, eu congue nunc tempus eu.

Nam vitae finibus eros, eu eleifend erat. Maecenas hendrerit magna quis molestie dictum. Ut consequat quam eu massa auctor pulvinar. Pellentesque vitae eros ornare urna accumsan tempor. Maecenas porta id quam at sodales. Donec quis accumsan leo, vel viverra nibh. Vestibulum congue blandit nulla, sed rhoncus libero eleifend ac. In risus lorem, rutrum et tincidunt a, interdum a lectus. Pellentesque aliquet pulvinar mauris, ut ultrices nibh ultricies nec. Mauris mi mauris, facilisis nec metus non, egestas luctus ligula. Quisque ac ligula at felis mollis blandit id nec risus. Nam sollicitudin lacus sed sapien fringilla ullamcorper. Etiam dui quam, posuere sit amet velit id, aliquet molestie ante. Integer cursus eget sapien fringilla elementum. Integer molestie, mi ac scelerisque ultrices, nunc purus condimentum est, in posuere quam nibh vitae velit.
"""


async def test(router):
    # random is to break caching
    completion = await router.acompletion(
        "gpt-4o-2024-08-06",
        [
            {
                "role": "user",
                "content": f"{pre_fill * 3}\n\nRecite the Declaration of independence at a speed of {random.random() * 100} words per minute.",
            }
        ],
        stream=True,
        temperature=0.0,
        stream_options={"include_usage": True},
    )

    async for chunk in completion:
        pass
    print("done", chunk)


async def main():
    router = Router(
        model_list=[
            {
                "model_name": "gpt-4o-2024-08-06",
                "litellm_params": {"model": "gpt-4o-2024-08-06", "temperature": 0.0},
                "rpm": 500,
                "tpm": 30000,
            }
        ],
    )
    await asyncio.gather(*[test(router) for _ in range(16)])


if __name__ == "__main__":
    asyncio.run(main())

From my understanding of documentation, the Router should respect the tpm/rpm even if multiple async processes are calling.

Relevant log output

Give Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new
LiteLLM.Info: If you need to debug this error, use `litellm.set_verbose=True'.


Give Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new
LiteLLM.Info: If you need to debug this error, use `litellm.set_verbose=True'.


Give Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new
LiteLLM.Info: If you need to debug this error, use `litellm.set_verbose=True'.


Give Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new
LiteLLM.Info: If you need to debug this error, use `litellm.set_verbose=True'.


Give Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new
LiteLLM.Info: If you need to debug this error, use `litellm.set_verbose=True'.

Traceback (most recent call last):
     ....truncated...
    raise self._make_status_error_from_response(err.response) from None
openai.RateLimitError: Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o in project proj_jzNCwszQ4Zq0tro9C0lBYucG organization org-RS0043BOXejyTcsf1iSYXVXC on tokens per min (TPM): Limit 30000, Used 29281, Requested 2664. Please try again in 3.89s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}}

whitead avatar Sep 19 '24 06:09 whitead

I am also running into this issue.

derspotter avatar Sep 20 '24 00:09 derspotter

hi @whitead @derspotter - please use the following in your router settings if you want litellm router to enforce tpm/rpm checks doc: https://docs.litellm.ai/docs/routing#advanced---routing-strategies-%EF%B8%8F

 routing_strategy="usage-based-routing-v2" # 👈 KEY CHANGE
 enable_pre_call_check=True, # enables router rate limits for concurrent calls
from litellm import Router 


model_list = [{ # list of model deployments 
    "model_name": "gpt-3.5-turbo", # model alias 
    "litellm_params": { # params for litellm completion/embedding call 
        "model": "azure/chatgpt-v-2", # actual model name
        "api_key": os.getenv("AZURE_API_KEY"),
        "api_version": os.getenv("AZURE_API_VERSION"),
        "api_base": os.getenv("AZURE_API_BASE")
    }, 
    "tpm": 100000,
    "rpm": 10000,
}, {
    "model_name": "gpt-3.5-turbo", 
    "litellm_params": { # params for litellm completion/embedding call 
        "model": "azure/chatgpt-functioncalling", 
        "api_key": os.getenv("AZURE_API_KEY"),
        "api_version": os.getenv("AZURE_API_VERSION"),
        "api_base": os.getenv("AZURE_API_BASE")
    },
    "tpm": 100000,
    "rpm": 1000,
}, {
    "model_name": "gpt-3.5-turbo", 
    "litellm_params": { # params for litellm completion/embedding call 
        "model": "gpt-3.5-turbo", 
        "api_key": os.getenv("OPENAI_API_KEY"),
    },
    "tpm": 100000,
    "rpm": 1000,
}]
router = Router(model_list=model_list, 
                redis_host=os.environ["REDIS_HOST"], 
                redis_password=os.environ["REDIS_PASSWORD"], 
                redis_port=os.environ["REDIS_PORT"], 
                routing_strategy="usage-based-routing-v2" # 👈 KEY CHANGE
                enable_pre_call_check=True, # enables router rate limits for concurrent calls
                )

response = await router.acompletion(model="gpt-3.5-turbo", 
                messages=[{"role": "user", "content": "Hey, how's it going?"}]

print(response) 

ishaan-jaff avatar Sep 20 '24 00:09 ishaan-jaff

Hi @ishaan-jaff - I tried this variation:

    router = Router(
        routing_strategy="usage-based-routing-v2",
        enable_pre_call_checks=True, 
        model_list=[
            {
                "model_name": "gpt-4o-2024-08-06",
                "litellm_params": {"model": "gpt-4o-2024-08-06", "temperature": 0.0},
                "rpm": 500,
                "tpm": 30000,
            }
        ],
    )

and it gave the same behavior (failed due to rate limit errors). Do note that the docs need to change the name of that argument - it is enable_pre_call_checks not enable_pre_call_check.

Is it the case that I need to have a redis instance to use rate limiting?

whitead avatar Sep 20 '24 03:09 whitead

@whitead redis is only needed if you're across multiple instances

krrishdholakia avatar Sep 20 '24 16:09 krrishdholakia

Thanks @krrishdholakia - then can you confirm if what I'm seeing is a bug or have I misconfigured or misunderstood the Router behavior?

whitead avatar Sep 20 '24 17:09 whitead

👍 testing locally to see if i can repro the issue @whitead

krrishdholakia avatar Sep 20 '24 20:09 krrishdholakia

immediate issue:

  • your tpm is 30000, we convert the tpm to an estimated rpm (if no rpm is given) using the azure formula of tpm/6, so the 'allowed' max parallel requests for this instance would be tpm/6 = 5000 https://github.com/BerriAI/litellm/blob/3933fba41fc51c2495fb1f4ce4791405e2b13968/litellm/utils.py#L3983

  • your testing on 16 queries, when the max allowed for the model based on your input is 500 (seeing the rpm you set)

krrishdholakia avatar Sep 20 '24 20:09 krrishdholakia

If the max_parallel_requests is not set, then we use the rpm as the upperbound for concurrent requests, if tpm given but no rpm, then we do tpm/6 (azure formula) to approximate the rpm for the model

max_parallel_requests = max_parallel_requests or rpm or tpm/6 

requests are kept within a semaphore, to make sure more than the max allowed are made concurrently. Requests will only fail if they timeout.

krrishdholakia avatar Sep 20 '24 20:09 krrishdholakia

Example working code: (note how the requests are done in-order)

import asyncio
import os

#### What this tests ####
#    This tests caching on the router
import sys
import time
import traceback
from typing import Dict
from unittest.mock import MagicMock, PropertyMock, patch

import pytest
from openai.lib.azure import OpenAIError

sys.path.insert(
    0, os.path.abspath("../..")
)  # Adds the parent directory to the system path
import asyncio
import random

from litellm import Router

pre_fill = """


Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc ut finibus massa. Quisque a magna magna. Quisque neque diam, varius sit amet tellus eu, elementum fermentum sapien. Integer ut erat eget arcu rutrum blandit. Morbi a metus purus. Nulla porta, urna at finibus malesuada, velit ante suscipit orci, vitae laoreet dui ligula ut augue. Cras elementum pretium dui, nec luctus nulla aliquet ut. Nam faucibus, diam nec semper interdum, nisl nisi viverra nulla, vitae sodales elit ex a purus. Donec tristique malesuada lobortis. Donec posuere iaculis nisl, vitae accumsan libero dignissim dignissim. Suspendisse finibus leo et ex mattis tempor. Praesent at nisl vitae quam egestas lacinia. Donec in justo non erat aliquam accumsan sed vitae ex. Vivamus gravida diam vel ipsum tincidunt dignissim.

Cras vitae efficitur tortor. Curabitur vel erat mollis, euismod diam quis, consequat nibh. Ut vel est eu nulla euismod finibus. Aliquam euismod at risus quis dignissim. Integer non auctor massa. Nullam vitae aliquet mauris. Etiam risus enim, dignissim ut volutpat eget, pulvinar ac augue. Mauris elit est, ultricies vel convallis at, rhoncus nec elit. Aenean ornare maximus orci, ut maximus felis cursus venenatis. Nulla facilisi.

Maecenas aliquet ante massa, at ullamcorper nibh dictum quis. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Quisque id egestas justo. Suspendisse fringilla in massa in consectetur. Quisque scelerisque egestas lacus at posuere. Vestibulum dui sem, bibendum vehicula ultricies vel, blandit id nisi. Curabitur ullamcorper semper metus, vitae commodo magna. Nulla mi metus, suscipit in neque vitae, porttitor pharetra erat. Vestibulum libero velit, congue in diam non, efficitur suscipit diam. Integer arcu velit, fermentum vel tortor sit amet, venenatis rutrum felis. Donec ultricies enim sit amet iaculis mattis.

Integer at purus posuere, malesuada tortor vitae, mattis nibh. Mauris ex quam, tincidunt et fermentum vitae, iaculis non elit. Nullam dapibus non nisl ac sagittis. Duis lacinia eros iaculis lectus consectetur vehicula. Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. Interdum et malesuada fames ac ante ipsum primis in faucibus. Ut cursus semper est, vel interdum turpis ultrices dictum. Suspendisse posuere lorem et accumsan ultrices. Duis sagittis bibendum consequat. Ut convallis vestibulum enim, non dapibus est porttitor et. Quisque suscipit pulvinar turpis, varius tempor turpis. Vestibulum semper dui nunc, vel vulputate elit convallis quis. Fusce aliquam enim nulla, eu congue nunc tempus eu.

Nam vitae finibus eros, eu eleifend erat. Maecenas hendrerit magna quis molestie dictum. Ut consequat quam eu massa auctor pulvinar. Pellentesque vitae eros ornare urna accumsan tempor. Maecenas porta id quam at sodales. Donec quis accumsan leo, vel viverra nibh. Vestibulum congue blandit nulla, sed rhoncus libero eleifend ac. In risus lorem, rutrum et tincidunt a, interdum a lectus. Pellentesque aliquet pulvinar mauris, ut ultrices nibh ultricies nec. Mauris mi mauris, facilisis nec metus non, egestas luctus ligula. Quisque ac ligula at felis mollis blandit id nec risus. Nam sollicitudin lacus sed sapien fringilla ullamcorper. Etiam dui quam, posuere sit amet velit id, aliquet molestie ante. Integer cursus eget sapien fringilla elementum. Integer molestie, mi ac scelerisque ultrices, nunc purus condimentum est, in posuere quam nibh vitae velit.
"""


async def test(router, idx: int):
    # random is to break caching
    completion = await router.acompletion(
        "gpt-4o-2024-08-06",
        [
            {
                "role": "user",
                "content": f"{pre_fill * 3}\n\nRecite the Declaration of independence at a speed of {random.random() * 100} words per minute.",
            }
        ],
        stream=True,
        temperature=0.0,
        stream_options={"include_usage": True},
    )

    async for chunk in completion:
        pass
    print("{}: done".format(idx))


async def main():
    router = Router(
        model_list=[
            {
                "model_name": "gpt-4o-2024-08-06",
                "litellm_params": {
                    "model": "gpt-4o-2024-08-06",
                    "temperature": 0.0,
                    "rpm": 500,
                    "tpm": 30000,
                    "max_parallel_requests": 1,
                },
            }
        ],
    )
    await asyncio.gather(*[test(router, idx=idx) for idx in range(16)])


if __name__ == "__main__":
    asyncio.run(main())

krrishdholakia avatar Sep 20 '24 20:09 krrishdholakia

Hi @krrishdholakia - thanks for taking the time! So, if I'm reading your response correctly - the tpm field is only used to create an rpm field if missing? That tmp doesn't result in actual accounting of the tokens when doing rate limiting?

whitead avatar Sep 20 '24 20:09 whitead

not correct @whitead

tpm is used for rate limiting. Your issue is on max concurrent limits - that uses the max_parallel_requests param if set, else uses rpm/estimate of rpm.

krrishdholakia avatar Sep 20 '24 20:09 krrishdholakia

I'm not actually getting errors related to RPMs or concurrency. All my errors are about TPM usage. I tried restricting to one parallel call: here's the code I tried:

import asyncio
import random

from litellm import Router

pre_fill = """


Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc ut finibus massa. Quisque a magna magna. Quisque neque diam, varius sit amet tellus eu, elementum fermentum sapien. Integer ut erat eget arcu rutrum blandit. Morbi a metus purus. Nulla porta, urna at finibus malesuada, velit ante suscipit orci, vitae laoreet dui ligula ut augue. Cras elementum pretium dui, nec luctus nulla aliquet ut. Nam faucibus, diam nec semper interdum, nisl nisi viverra nulla, vitae sodales elit ex a purus. Donec tristique malesuada lobortis. Donec posuere iaculis nisl, vitae accumsan libero dignissim dignissim. Suspendisse finibus leo et ex mattis tempor. Praesent at nisl vitae quam egestas lacinia. Donec in justo non erat aliquam accumsan sed vitae ex. Vivamus gravida diam vel ipsum tincidunt dignissim.

Cras vitae efficitur tortor. Curabitur vel erat mollis, euismod diam quis, consequat nibh. Ut vel est eu nulla euismod finibus. Aliquam euismod at risus quis dignissim. Integer non auctor massa. Nullam vitae aliquet mauris. Etiam risus enim, dignissim ut volutpat eget, pulvinar ac augue. Mauris elit est, ultricies vel convallis at, rhoncus nec elit. Aenean ornare maximus orci, ut maximus felis cursus venenatis. Nulla facilisi.

Maecenas aliquet ante massa, at ullamcorper nibh dictum quis. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Quisque id egestas justo. Suspendisse fringilla in massa in consectetur. Quisque scelerisque egestas lacus at posuere. Vestibulum dui sem, bibendum vehicula ultricies vel, blandit id nisi. Curabitur ullamcorper semper metus, vitae commodo magna. Nulla mi metus, suscipit in neque vitae, porttitor pharetra erat. Vestibulum libero velit, congue in diam non, efficitur suscipit diam. Integer arcu velit, fermentum vel tortor sit amet, venenatis rutrum felis. Donec ultricies enim sit amet iaculis mattis.

Integer at purus posuere, malesuada tortor vitae, mattis nibh. Mauris ex quam, tincidunt et fermentum vitae, iaculis non elit. Nullam dapibus non nisl ac sagittis. Duis lacinia eros iaculis lectus consectetur vehicula. Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. Interdum et malesuada fames ac ante ipsum primis in faucibus. Ut cursus semper est, vel interdum turpis ultrices dictum. Suspendisse posuere lorem et accumsan ultrices. Duis sagittis bibendum consequat. Ut convallis vestibulum enim, non dapibus est porttitor et. Quisque suscipit pulvinar turpis, varius tempor turpis. Vestibulum semper dui nunc, vel vulputate elit convallis quis. Fusce aliquam enim nulla, eu congue nunc tempus eu.

Nam vitae finibus eros, eu eleifend erat. Maecenas hendrerit magna quis molestie dictum. Ut consequat quam eu massa auctor pulvinar. Pellentesque vitae eros ornare urna accumsan tempor. Maecenas porta id quam at sodales. Donec quis accumsan leo, vel viverra nibh. Vestibulum congue blandit nulla, sed rhoncus libero eleifend ac. In risus lorem, rutrum et tincidunt a, interdum a lectus. Pellentesque aliquet pulvinar mauris, ut ultrices nibh ultricies nec. Mauris mi mauris, facilisis nec metus non, egestas luctus ligula. Quisque ac ligula at felis mollis blandit id nec risus. Nam sollicitudin lacus sed sapien fringilla ullamcorper. Etiam dui quam, posuere sit amet velit id, aliquet molestie ante. Integer cursus eget sapien fringilla elementum. Integer molestie, mi ac scelerisque ultrices, nunc purus condimentum est, in posuere quam nibh vitae velit.
"""


async def test(router):
    completion = await router.acompletion(
        "gpt-4o-2024-08-06",
        [
            {
                "role": "user",
                "content": f"{pre_fill * 3}\n\nRecite the Declaration of independence at a speed of {random.random() * 100} words per minute.",
            }
        ],
        stream=True,
        temperature=0.0,
        stream_options={"include_usage": True},
    )

    async for chunk in completion:
        pass
    print("done", chunk)


async def main():
    router = Router(
        routing_strategy="usage-based-routing-v2",
        enable_pre_call_checks=True, 
        model_list=[
            {
                "model_name": "gpt-4o-2024-08-06",
                "litellm_params": {"model": "gpt-4o-2024-08-06", "temperature": 0.0},
                "rpm": 500,
                "max_parallel_requests": 1,
                "should_pre_call_checks": True,
                "tpm": 30000,
            }
        ],
    )
    await asyncio.gather(*[test(router) for _ in range(16)])


if __name__ == "__main__":
    asyncio.run(main())

and I receive the following output:

litellm.exceptions.RateLimitError: litellm.RateLimitError: RateLimitError: OpenAIException - Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o in project xxx organization org-xxxx on tokens per min (TPM): Limit 30000, Used 28635, Requested 2664. Please try again in 2.598s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}}
Received Model Group=gpt-4o-2024-08-06
Available Model Group Fallbacks=None LiteLLM Retried: 1 times, LiteLLM Max Retries: 2

whitead avatar Sep 20 '24 21:09 whitead

@whitead your model is hitting it's rate limits. It sounds like what you're looking for is something to keep checking if a model is now below it's rate limits, and then make the request.

This should solve the issue - https://docs.litellm.ai/docs/scheduler#quick-start

Key changes:

  • define a polling_interval in your Router init
  • set a request priority (i can also enable a 'default_priority' to make this simpler)

Can you let me know if this solves your problem?

krrishdholakia avatar Sep 21 '24 20:09 krrishdholakia

should_pre_call_checks

also not sure what this is

krrishdholakia avatar Sep 21 '24 20:09 krrishdholakia

Curious - would you expect the Router to just do this retry by default?

krrishdholakia avatar Sep 21 '24 21:09 krrishdholakia

On our end, we can:

  • check tpm limits (prev. was just rpm limits) for concurrent calls.
  • retry subsequent requests (waiting till model is below tpm limits)

krrishdholakia avatar Sep 21 '24 21:09 krrishdholakia

looks like we should have done the retry after 2.598s

krrishdholakia avatar Sep 21 '24 22:09 krrishdholakia

trying to simulate and confirming this should work as expected. fixed a bug where instant retries were happening even for single model, model groups

krrishdholakia avatar Sep 21 '24 22:09 krrishdholakia

For TPM, if we want to keep the update usage logic in a parallel thread (don't slow down actual completion calls for this) then the next best approach is to read the retry time received from openai and use that correctly.

Done here - https://github.com/BerriAI/litellm/commit/bba003cd9a52d12490026ea5ab31240d02cd13ae

krrishdholakia avatar Sep 21 '24 23:09 krrishdholakia

Hi @krrishdholakia, thanks for your work here last night. Though pulling in the latest LiteLLM (litellm==1.47.1, or using current main at https://github.com/BerriAI/litellm/tree/16c8549b773d3b4ffcb986101dc268fead84ae60) did not fix this issue, can you reopen it?

From https://platform.openai.com/docs/guides/rate-limits/tier-1-rate-limits:

Model RPM TPM gpt-4o 500 30,000

So running the original post's script and specifying the Router like so:

    router = Router(
        routing_strategy="usage-based-routing-v2",
        enable_pre_call_checks=True,
        model_list=[
            {
                "model_name": "gpt-4o-2024-08-06",
                "litellm_params": {"model": "gpt-4o-2024-08-06", "temperature": 0.0},
                "rpm": 500,
                "max_parallel_requests": 1,
                "tpm": 30000,
            }
        ],
    )

We hit the rate limits:

Give Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new
LiteLLM.Info: If you need to debug this error, use `litellm.set_verbose=True'.


Give Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new
LiteLLM.Info: If you need to debug this error, use `litellm.set_verbose=True'.


Give Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new
LiteLLM.Info: If you need to debug this error, use `litellm.set_verbose=True'.


Give Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new
LiteLLM.Info: If you need to debug this error, use `litellm.set_verbose=True'.


Give Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new
LiteLLM.Info: If you need to debug this error, use `litellm.set_verbose=True'.

done ModelResponse(id='chatcmpl-abc123', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content=None, role=None, function_call=None, tool_calls=None), logprobs=None)], created=1727033797, model='gpt-4o-2024-08-06', object='chat.completion.chunk', system_fingerprint='fp_5050236cbd', usage=Usage(completion_tokens=41, prompt_tokens=2456, total_tokens=2497, completion_tokens_details=None))
done ModelResponse(id='chatcmpl-abc123', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content=None, role=None, function_call=None, tool_calls=None), logprobs=None)], created=1727033797, model='gpt-4o-2024-08-06', object='chat.completion.chunk', system_fingerprint='fp_5050236cbd', usage=Usage(completion_tokens=56, prompt_tokens=2456, total_tokens=2512, completion_tokens_details=None))
done ModelResponse(id='chatcmpl-abc123', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content=None, role=None, function_call=None, tool_calls=None), logprobs=None)], created=1727033797, model='gpt-4o-2024-08-06', object='chat.completion.chunk', system_fingerprint='fp_5050236cbd', usage=Usage(completion_tokens=41, prompt_tokens=2456, total_tokens=2497, completion_tokens_details=None))
done ModelResponse(id='chatcmpl-abc123', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content=None, role=None, function_call=None, tool_calls=None), logprobs=None)], created=1727033797, model='gpt-4o-2024-08-06', object='chat.completion.chunk', system_fingerprint='fp_5050236cbd', usage=Usage(completion_tokens=41, prompt_tokens=2456, total_tokens=2497, completion_tokens_details=None))
done ModelResponse(id='chatcmpl-abc123', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content=None, role=None, function_call=None, tool_calls=None), logprobs=None)], created=1727033797, model='gpt-4o-2024-08-06', object='chat.completion.chunk', system_fingerprint='fp_5050236cbd', usage=Usage(completion_tokens=41, prompt_tokens=2456, total_tokens=2497, completion_tokens_details=None))
done ModelResponse(id='chatcmpl-abc123', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content=None, role=None, function_call=None, tool_calls=None), logprobs=None)], created=1727033797, model='gpt-4o-2024-08-06', object='chat.completion.chunk', system_fingerprint='fp_5050236cbd', usage=Usage(completion_tokens=56, prompt_tokens=2456, total_tokens=2512, completion_tokens_details=None))
done ModelResponse(id='chatcmpl-abc123', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content=None, role=None, function_call=None, tool_calls=None), logprobs=None)], created=1727033797, model='gpt-4o-2024-08-06', object='chat.completion.chunk', system_fingerprint='fp_5050236cbd', usage=Usage(completion_tokens=53, prompt_tokens=2456, total_tokens=2509, completion_tokens_details=None))
done ModelResponse(id='chatcmpl-abc123', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content=None, role=None, function_call=None, tool_calls=None), logprobs=None)], created=1727033797, model='gpt-4o-2024-08-06', object='chat.completion.chunk', system_fingerprint='fp_5050236cbd', usage=Usage(completion_tokens=56, prompt_tokens=2456, total_tokens=2512, completion_tokens_details=None))
done ModelResponse(id='chatcmpl-abc123', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content=None, role=None, function_call=None, tool_calls=None), logprobs=None)], created=1727033797, model='gpt-4o-2024-08-06', object='chat.completion.chunk', system_fingerprint='fp_5050236cbd', usage=Usage(completion_tokens=53, prompt_tokens=2456, total_tokens=2509, completion_tokens_details=None))
done ModelResponse(id='chatcmpl-abc123', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content=None, role=None, function_call=None, tool_calls=None), logprobs=None)], created=1727033797, model='gpt-4o-2024-08-06', object='chat.completion.chunk', system_fingerprint='fp_5050236cbd', usage=Usage(completion_tokens=41, prompt_tokens=2456, total_tokens=2497, completion_tokens_details=None))
done ModelResponse(id='chatcmpl-abc123', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content=None, role=None, function_call=None, tool_calls=None), logprobs=None)], created=1727033797, model='gpt-4o-2024-08-06', object='chat.completion.chunk', system_fingerprint='fp_5050236cbd', usage=Usage(completion_tokens=112, prompt_tokens=2456, total_tokens=2568, completion_tokens_details=None))
Traceback (most recent call last):
  File "/Users/user/code/repo/.venv/lib/python3.12/site-packages/litellm/llms/OpenAI/openai.py", line 1106, in async_streaming
    headers, response = await self.make_openai_chat_completion_request(
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user/code/repo/.venv/lib/python3.12/site-packages/litellm/llms/OpenAI/openai.py", line 658, in make_openai_chat_completion_request
    raise e
  File "/Users/user/code/repo/.venv/lib/python3.12/site-packages/litellm/llms/OpenAI/openai.py", line 646, in make_openai_chat_completion_request
    await openai_aclient.chat.completions.with_raw_response.create(
  File "/Users/user/code/repo/.venv/lib/python3.12/site-packages/openai/_legacy_response.py", line 370, in wrapped
    return cast(LegacyAPIResponse[R], await func(*args, **kwargs))
                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user/code/repo/.venv/lib/python3.12/site-packages/openai/resources/chat/completions.py", line 1412, in create
    return await self._post(
           ^^^^^^^^^^^^^^^^^
  File "/Users/user/code/repo/.venv/lib/python3.12/site-packages/openai/_base_client.py", line 1829, in post
    return await self.request(cast_to, opts, stream=stream, stream_cls=stream_cls)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user/code/repo/.venv/lib/python3.12/site-packages/openai/_base_client.py", line 1523, in request
    return await self._request(
           ^^^^^^^^^^^^^^^^^^^^
  File "/Users/user/code/repo/.venv/lib/python3.12/site-packages/openai/_base_client.py", line 1624, in _request
    raise self._make_status_error_from_response(err.response) from None
openai.RateLimitError: Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o in project proj_abc123 organization org-def456 on tokens per min (TPM): Limit 30000, Used 29253, Requested 2661. Please try again in 3.827s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}}

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/user/code/repo/.venv/lib/python3.12/site-packages/litellm/main.py", line 428, in acompletion
    response = await init_response
               ^^^^^^^^^^^^^^^^^^^
  File "/Users/user/code/repo/.venv/lib/python3.12/site-packages/litellm/llms/OpenAI/openai.py", line 1171, in async_streaming
    raise OpenAIError(
litellm.llms.OpenAI.openai.OpenAIError: Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o in project proj_abc123 organization org-def456 on tokens per min (TPM): Limit 30000, Used 29253, Requested 2661. Please try again in 3.827s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}}

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/user/code/repo/paperqa/play.py", line 54, in <module>
    asyncio.run(main())
  File "/Users/user/.pyenv/versions/3.12.5/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/asyncio/runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/Users/user/.pyenv/versions/3.12.5/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user/.pyenv/versions/3.12.5/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/asyncio/base_events.py", line 687, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/Users/user/code/repo/paperqa/play.py", line 50, in main
    await asyncio.gather(*[router_completion(router) for _ in range(16)])
  File "/Users/user/code/repo/paperqa/play.py", line 18, in router_completion
    completion = await router.acompletion(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user/code/repo/.venv/lib/python3.12/site-packages/litellm/router.py", line 737, in acompletion
    raise e
  File "/Users/user/code/repo/.venv/lib/python3.12/site-packages/litellm/router.py", line 725, in acompletion
    response = await self.async_function_with_fallbacks(**kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user/code/repo/.venv/lib/python3.12/site-packages/litellm/router.py", line 3041, in async_function_with_fallbacks
    raise original_exception
  File "/Users/user/code/repo/.venv/lib/python3.12/site-packages/litellm/router.py", line 2895, in async_function_with_fallbacks
    response = await self.async_function_with_retries(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user/code/repo/.venv/lib/python3.12/site-packages/litellm/router.py", line 3170, in async_function_with_retries
    raise original_exception
  File "/Users/user/code/repo/.venv/lib/python3.12/site-packages/litellm/router.py", line 3085, in async_function_with_retries
    response = await original_function(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user/code/repo/.venv/lib/python3.12/site-packages/litellm/router.py", line 876, in _acompletion
    raise e
  File "/Users/user/code/repo/.venv/lib/python3.12/site-packages/litellm/router.py", line 843, in _acompletion
    response = await _response
               ^^^^^^^^^^^^^^^
  File "/Users/user/code/repo/.venv/lib/python3.12/site-packages/litellm/utils.py", line 1589, in wrapper_async
    raise e
  File "/Users/user/code/repo/.venv/lib/python3.12/site-packages/litellm/utils.py", line 1409, in wrapper_async
    result = await original_function(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user/code/repo/.venv/lib/python3.12/site-packages/litellm/main.py", line 450, in acompletion
    raise exception_type(
          ^^^^^^^^^^^^^^^
  File "/Users/user/code/repo/.venv/lib/python3.12/site-packages/litellm/utils.py", line 8207, in exception_type
    raise e
  File "/Users/user/code/repo/.venv/lib/python3.12/site-packages/litellm/utils.py", line 6565, in exception_type
    raise RateLimitError(
litellm.exceptions.RateLimitError: litellm.RateLimitError: RateLimitError: OpenAIException - Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o in project proj_abc123 organization org-def456 on tokens per min (TPM): Limit 30000, Used 29253, Requested 2661. Please try again in 3.827s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}}
Received Model Group=gpt-4o-2024-08-06
Available Model Group Fallbacks=None LiteLLM Retried: 1 times, LiteLLM Max Retries: 2

You can see retrying taking place, but then we still hit TPM limits. So I think LiteLLM Router still doesn't quite respect tokens per minute. Note the prompt specified here:

  • Has a small amount of requests (16 completions), so it passes rpm checks
  • Has a huge amount of tokens per minute, so it fails tpm checks

jamesbraza avatar Sep 22 '24 18:09 jamesbraza

Hi @jamesbraza can you share the prompt? I ran with a mock server, and could see the router retrying the request based on the given rate limit.

krrishdholakia avatar Sep 23 '24 13:09 krrishdholakia

More on this:

  • I'm not sure how we can respect (i.e. prevent exceeding) concurrent TPM limits
  • The RPM limits are respected by using a semaphore.

Current approach is to just make sure we respect the openai/azure retry-after header, and retry accordingly.

Open to suggestions @whitead @jamesbraza

krrishdholakia avatar Sep 23 '24 13:09 krrishdholakia

Thanks for reopening, yeah retrying alleviates intermittent tpm issues but is not a general solution. An aside is if LiteLLM could also support retry-after for Anthropic (requested here), it would be useful to us.

To support tpm, I believe you'd need to have a system that:

  • Uses tiktoken (or similar packages) to compute tokens when applicable (e.g. model is an OpenAI model)
  • Otherwise, uses the max_tokens field of a request (when present). So if someone has max_tokens of 1000 with Anthropic, you can factor that into tpm calculations
  • Recalibrates using a response's token counts. For example, I believe Anthropic gives token counts back in its response
  • Otherwise, fall back on a simple heuristic, something like assuming one token per word, or a basic formula

What do you think of that?

jamesbraza avatar Sep 23 '24 21:09 jamesbraza

Recalibrates using a response's token counts

This is why it doesn't work for concurrent requests @jamesbraza, since they happen at the same time.

I can investigate trying to create a lock which maybe uses a model group max token, but token counting with tiktoken for non-openai models can be pretty inconsistent.

Using rpm limits seems like a much safer way of implementing concurrency limits.

krrishdholakia avatar Sep 27 '24 16:09 krrishdholakia

Hi @krrishdholakia,

Thanks for being responsive on this. Understand where you're coming from here - it's complicated to respect TPMs and the approach you took in liteLLM is reasonable. Maybe, to help future developers, you can just put in the router documentation that the TPM limits are just a hint for enacting RPM limits.

We're not blocked on this anymore as we rolled our own rate limits using the limits package: https://gitaub.comA/Future-House/paper-qa/pull/520.

whitead avatar Oct 15 '24 01:10 whitead

Hi @whitead i'm trying to learn how you handled the tpm limits

Could you point me to what code file is handling this here? https://github.com/Future-House/paper-qa/pull/520

krrishdholakia avatar Oct 15 '24 17:10 krrishdholakia

Hey @jamesbraza i saw your PR mentioned race conditions with the router

What were they?

krrishdholakia avatar Oct 15 '24 18:10 krrishdholakia

Of course! https://github.com/Future-House/paper-qa/blob/main/paperqa/rate_limiter.py

whitead avatar Oct 15 '24 18:10 whitead

Hey @jamesbraza i saw https://github.com/Future-House/paper-qa/pull/563 mentioned race conditions with the router

What were they?

It's https://github.com/Future-House/paper-qa/pull/563 where LiteLLM randomly forgets previously set up deployment hashes

jamesbraza avatar Oct 15 '24 18:10 jamesbraza

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

github-actions[bot] avatar Jan 28 '25 02:01 github-actions[bot]