litellm icon indicating copy to clipboard operation
litellm copied to clipboard

[Bug]: Cannot get past 50 RPS

Open vutrung96 opened this issue 1 year ago • 5 comments

What happened?

I have OpenAI tier 5 usage, which should give me 30,000 RPM = 500 RPS with "gpt-4o-mini". However I struggle get past 50 RPS.

The minimal replication:

from litellm import acompletion

tasks = [acompletion(
    model="gpt-4o-mini",
    messages=[
      {"role": "system", "content": "You're an agent who answers yes or no"},
      {"role": "user", "content": "Is the sky blue?"},
    ],
) for i in range(2000)]

I only get 50 items/second as opposed to ~500 items/second when sending raw HTTP requests.

Relevant log output

 16%|█████████████████████▌                                                                                                                 | 320/2000 [00:09<00:40, 41.49it/s]

Twitter / LinkedIn details

No response

vutrung96 avatar Nov 05 '24 01:11 vutrung96

hi @vutrung96 looking into this, how do you get the % complete log output ?

ishaan-jaff avatar Nov 14 '24 16:11 ishaan-jaff

Hi @ishaan-jaff I was just using tqdm

vutrung96 avatar Nov 16 '24 18:11 vutrung96

Hi @ishaan-jaff , any updates on this, also facing this issue!

CharlieJCJ avatar Nov 19 '24 03:11 CharlieJCJ

hi @vutrung96 @CharlieJCJ do you see the issue on litellm.router too ? https://docs.litellm.ai/docs/routing

It would help me if you could test with litellm router too

ishaan-jaff avatar Nov 21 '24 16:11 ishaan-jaff

Hi @ishaan-jaff We tracked down the root cause of the issue.

Litellm uses the official OpenAI python client

client: Optional[Union[OpenAI, AsyncOpenAI]] = None,

The official OpenAI client has performance issues with high numbers of concurrent requests due to issues in httpx

  • https://github.com/openai/openai-python/issues/1596

The issues in httpx are due to a number of factors related to anyio vs asyncio

  • https://github.com/encode/httpx/issues/3215

Which are addressed in the open PRs below

  • https://github.com/encode/httpcore/pull/922
  • https://github.com/encode/httpcore/pull/928
  • https://github.com/encode/httpcore/pull/929
  • https://github.com/encode/httpcore/pull/930
  • https://github.com/encode/httpcore/pull/953

We saw this when implementing litellm as the backend for our synthetic data engine

  • https://github.com/bespokelabsai/curator/pull/141

When using our own openai client (with aiohttp instead of httpx) we saturate the highest rate limits (30,000 requests per minute on gpt-4o-mini tier 5). When using litellm, the performance issues cap us well under the highest rate limit (200 queries per second - 12,000 requests per minute).

RyanMarten avatar Dec 04 '24 14:12 RyanMarten

@RyanMarten you are right ! just ran a load test to confirm. The right is with aiohttp it's 10x more RPS

Screenshot 2025-01-02 at 3 36 31 PM

ishaan-jaff avatar Jan 02 '25 23:01 ishaan-jaff

@RyanMarten started work on this

https://github.com/BerriAI/litellm/pull/7514

  • added a new custom_llm_provider=aiohttp_openai that uses aiohttp for calling logic

@RyanMarten, @vutrung96 and @CharlieJCJ can y'all help us test this change as we start rolling it out ?

As of now we just added support for non-streaming. I can let you know once streaming support is added too

ishaan-jaff avatar Jan 03 '25 06:01 ishaan-jaff

@ishaan-jaff Thanks for creating a PR for this! We can certainly help test the change 😄. I'll run a benchmarking test with model=aiohttp_openai/gpt-4o-mini

Our use-case on non-streaming so that shouldn't be a problem.

RyanMarten avatar Jan 07 '25 00:01 RyanMarten

Here is our benchmarking using the curator request processor and viewer (with different backends). I see that this was released in https://github.com/BerriAI/litellm/releases/tag/v1.56.8. I upgraded my litellm version to the latest poetry add litellm@latest which is 1.57.0.

from bespokelabs.curator import LLM
from datasets import Dataset

dataset = Dataset.from_dict({"prompt": ["write me a poem"] * 100_000})

(1) our own aiohttp backend

llm = LLM(
    prompt_func=lambda row: row["prompt"],
    model_name="gpt-4o-mini",
    backend="openai",
)
dataset = llm(dataset)
Screenshot 2025-01-06 at 4 51 08 PM

(2) default litellm backend

llm = LLM(
    prompt_func=lambda row: row["prompt"],
    model_name="gpt-4o-mini",
    backend="litellm",
)
dataset = llm(dataset)
Screenshot 2025-01-06 at 5 00 56 PM

(3) litellm backend with aiohttp_openai

llm = LLM(
    prompt_func=lambda row: row["prompt"],
    model_name="aiohttp_openai/gpt-4o-mini",
    backend="litellm",
)
dataset = llm(dataset)
Screenshot 2025-01-06 at 5 07 50 PM

For some reason I'm not seeing an improvement in performance

RyanMarten avatar Jan 07 '25 01:01 RyanMarten

hmm that's odd - we see RPS going much higher on our testing

Do you see anything off with our implementation (I know you mentioned you also implemented aiohttp) ?

https://github.com/BerriAI/litellm/blob/main/litellm/llms/custom_httpx/aiohttp_handler.py#L30

ishaan-jaff avatar Jan 07 '25 01:01 ishaan-jaff

ohh - I think I know the issue, it's still getting routed to the OpenAI sdk when you pass aiohttp_openai/gpt-4o-mini

(we route to using the OpenAI sdk if the model is recognized as an OpenAI model)

In my testing I was using aiohttp_openai/mock_model

will update this thread to ensure aiohttp_openai/gpt-4o-mini uses aiohttp_openai

ishaan-jaff avatar Jan 07 '25 01:01 ishaan-jaff

I'll take a look!

For reference here is our aiohttp implementation: https://github.com/bespokelabsai/curator/blob/0c7cf21a5af0a228904906de417d902fac5c2b5c/src/bespokelabs/curator/request_processor/online/openai_online_request_processor.py#L167

And here is how we are using litellm as a backend: https://github.com/bespokelabsai/curator/blob/0c7cf21a5af0a228904906de417d902fac5c2b5c/src/bespokelabs/curator/request_processor/online/litellm_online_request_processor.py#L210

RyanMarten avatar Jan 07 '25 01:01 RyanMarten

Ah yes, what you said about the routing makes sense!

When the fix is in, I'll try my benchmark again and post the results 👍

RyanMarten avatar Jan 07 '25 01:01 RyanMarten

Fixed here @RyanMarten https://github.com/BerriAI/litellm/pull/7598

could you test on our new release ? (Will be out in 12 hrs) on v1.57.2

ishaan-jaff avatar Jan 07 '25 05:01 ishaan-jaff

@ishaan-jaff - yes absolutely (looking out for the release)

RyanMarten avatar Jan 07 '25 18:01 RyanMarten

Sorry ci / cd causing issues - will update here once new release out

ishaan-jaff avatar Jan 07 '25 18:01 ishaan-jaff

@RyanMarten you are right ! just ran a load test to confirm. The right is with aiohttp it's 10x more RPS

Screenshot 2025-01-02 at 3 36 31 PM

@ishaan-jaff Also curious, what software / visualization are you using for your load tests?

RyanMarten avatar Jan 08 '25 23:01 RyanMarten

@RyanMarten -can you help test this: https://github.com/BerriAI/litellm/releases/tag/v1.57.3

Also curious, what software / visualization are you using for your load tests?

I was using locust

ishaan-jaff avatar Jan 09 '25 02:01 ishaan-jaff

poetry add litellm@latest Using version ^1.57.4 for litellm

from bespokelabs.curator import LLM
from datasets import Dataset

dataset = Dataset.from_dict({"prompt": ["write me a poem"] * 100_000})

llm = LLM(
    prompt_func=lambda row: row["prompt"],
    model_name="aiohttp_openai/gpt-4o-mini",
    backend="litellm",
)

dataset = llm(dataset)

I'm getting this error now which I wasn't before. I think this is an issue from our side, let me test.

Traceback (most recent call last):
  File "/Users/ryan/curator/.venv/lib/python3.12/site-packages/litellm/llms/custom_httpx/aiohttp_handler.py", line 112, in _make_common_sync_call
    response = sync_httpx_client.post(
               ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ryan/curator/.venv/lib/python3.12/site-packages/litellm/llms/custom_httpx/http_handler.py", line 528, in post
    raise e
  File "/Users/ryan/curator/.venv/lib/python3.12/site-packages/litellm/llms/custom_httpx/http_handler.py", line 509, in post
    response.raise_for_status()
  File "/Users/ryan/curator/.venv/lib/python3.12/site-packages/httpx/_models.py", line 763, in raise_for_status
    raise HTTPStatusError(message, request=request, response=self)
httpx.HTTPStatusError: Client error '400 Bad Request' for url 'https://api.openai.com/v1/chat/completions'
For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/400

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/ryan/curator/.venv/lib/python3.12/site-packages/litellm/main.py", line 1501, in completion
    response = base_llm_aiohttp_handler.completion(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ryan/curator/.venv/lib/python3.12/site-packages/litellm/llms/custom_httpx/aiohttp_handler.py", line 302, in completion
    response = self._make_common_sync_call(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ryan/curator/.venv/lib/python3.12/site-packages/litellm/llms/custom_httpx/aiohttp_handler.py", line 132, in _make_common_sync_call
    raise self._handle_error(e=e, provider_config=provider_config)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ryan/curator/.venv/lib/python3.12/site-packages/litellm/llms/custom_httpx/aiohttp_handler.py", line 389, in _handle_error
    raise provider_config.get_error_class(
litellm.llms.openai.common_utils.OpenAIError: {
    "error": {
        "message": "you must provide a model parameter",
        "type": "invalid_request_error",
        "param": null,
        "code": null
    }
}


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/ryan/curator/../dcft_dump/SPEED_TEST.py", line 6, in <module>
    llm = LLM(
          ^^^^
  File "/Users/ryan/curator/src/bespokelabs/curator/llm/llm.py", line 111, in __init__
    self._request_processor = _RequestProcessorFactory.create(backend_params, batch=batch, response_format=response_format, backend=backend)
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ryan/curator/src/bespokelabs/curator/request_processor/_factory.py", line 127, in create
    _request_processor = LiteLLMOnlineRequestProcessor(config)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ryan/curator/src/bespokelabs/curator/request_processor/online/litellm_online_request_processor.py", line 46, in __init__
    self.header_based_max_requests_per_minute, self.header_based_max_tokens_per_minute = self.get_header_based_rate_limits()
                                                                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ryan/curator/src/bespokelabs/curator/request_processor/online/litellm_online_request_processor.py", line 154, in get_header_based_rate_limits
    headers = self.test_call()
              ^^^^^^^^^^^^^^^^
  File "/Users/ryan/curator/src/bespokelabs/curator/request_processor/online/litellm_online_request_processor.py", line 127, in test_call
    completion = litellm.completion(
                 ^^^^^^^^^^^^^^^^^^^
  File "/Users/ryan/curator/.venv/lib/python3.12/site-packages/litellm/utils.py", line 1022, in wrapper
    raise e
  File "/Users/ryan/curator/.venv/lib/python3.12/site-packages/litellm/utils.py", line 900, in wrapper
    result = original_function(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ryan/curator/.venv/lib/python3.12/site-packages/litellm/main.py", line 2955, in completion
    raise exception_type(
          ^^^^^^^^^^^^^^^
  File "/Users/ryan/curator/.venv/lib/python3.12/site-packages/litellm/litellm_core_utils/exception_mapping_utils.py", line 2189, in exception_type
    raise e
  File "/Users/ryan/curator/.venv/lib/python3.12/site-packages/litellm/litellm_core_utils/exception_mapping_utils.py", line 2158, in exception_type
    raise APIConnectionError(
litellm.exceptions.APIConnectionError: litellm.APIConnectionError: Aiohttp_openaiException - {
    "error": {
        "message": "you must provide a model parameter",
        "type": "invalid_request_error",
        "param": null,
        "code": null
    }
}

RyanMarten avatar Jan 10 '25 00:01 RyanMarten

Ah this is because we do a test call with completion instead of acompletion

completion = litellm.completion(model="aiohttp_openai/gpt-4o-mini",messages=[{"role": "user", "content": "hi"}])

fails with an unintuitive error message

litellm.exceptions.APIConnectionError: litellm.APIConnectionError: Aiohttp_openaiException - {
    "error": {
        "message": "you must provide a model parameter",
        "type": "invalid_request_error",
        "param": null,
        "code": null
    }
}

What I can do is just switch this call to use acompletion as well

RyanMarten avatar Jan 10 '25 00:01 RyanMarten

OK now I'm running into an issue in the main loop where

2025-01-09 16:26:13,066 - bespokelabs.curator.request_processor.online.base_online_request_processor - WARNING - Encountered 'APIConnectionError: litellm.APIConnectionError: Aiohttp_openaiException - Event loop is closed' during attempt 1 of 10 while processing request 0

@vutrung96 could you take a look at this since you wrote the custom event loop handling?

RyanMarten avatar Jan 10 '25 00:01 RyanMarten

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

github-actions[bot] avatar Apr 11 '25 00:04 github-actions[bot]