litellm icon indicating copy to clipboard operation
litellm copied to clipboard

[Bug]: NO FALLBACK when streaming and [Bug]: litellm.InternalServerError: AnthropicException - Overloaded. Handle with litellm.InternalServerError.

Open laol777 opened this issue 1 year ago • 2 comments

What happened?

No fallback when streaming basicly the same problem like https://github.com/BerriAI/litellm/issues/6532 with very similar config

Router(
        model_list=settings.LITELLM_MODEL_DEPLOYMENTS,
        num_retries=3,
        retry_after=5,  # waits min 5s before retrying request
        timeout=290,
        allowed_fails=3,  # cooldown model if it fails > 3 call in a minute.
        cooldown_time=10,  # cooldown the deployment for 10 seconds if it num_fails > allowed_fails
        default_fallbacks=["claude-3-5-sonnet-aws"],
    )
# model = claude3.5 anthropic
# fallback model = claude3.5 aws bedrock

Relevant log output

No response

Twitter / LinkedIn details

No response

laol777 avatar Nov 28 '24 11:11 laol777

@krrishdholakia i found a way to reproduce this problem add these two lines of code

type_chunk = "error"
chunk["error"] = {"message": "Overload test error"}

here https://github.com/BerriAI/litellm/blob/main/litellm/llms/anthropic/chat/handler.py#L561

then call with this configuration

import asyncio

from litellm import Router

model_deployments = [
    {
        "model_name": "claude-3-5-sonnet",
        "litellm_params": {
            "model": "claude-3-5-sonnet-20241022",
            "api_key": "key",
        },
        "rpm": 4000,
    },
    {
        "model_name": "claude-3-5-sonnet-aws",
        "litellm_params": {
            "model": "bedrock/anthropic.claude-3-5-sonnet-20240620-v1:0",
            "aws_access_key_id": "key",
            "aws_secret_access_key": "key",
            "aws_region_name": "us-east-1",
        },
        "rpm": 50,
    },
]

messages = [
    {
        "role": "system",
        "content": [
            {
                "type": "text",
                "text": "You are a helpful assistant",
                "cache_control": {"type": "ephemeral"},
            }
        ],
    },
    {"role": "user", "content": "hello"},
]

stop_sequences = None
_router = Router(
    model_list=model_deployments,
    default_fallbacks=["claude-3-5-sonnet-aws"],
)


async def async_completion():
    response = await _router.acompletion(
        model="claude-3-5-sonnet",
        messages=messages,
        stream=True,
        stop=stop_sequences,
        stream_options={"include_usage": True},
    )

    async for chunk in response:
        pass

    print(42)

asyncio.run(async_completion())

laol777 avatar Dec 04 '24 15:12 laol777

thanks @laol777 will run this today

krrishdholakia avatar Dec 04 '24 16:12 krrishdholakia

@krrishdholakia hello, any updates on this issue?

laol777 avatar Dec 09 '24 15:12 laol777

hmm i would assume this issue is caused due to the request failing while iterating through the stream.

krrishdholakia avatar Dec 17 '24 17:12 krrishdholakia

This PR was not merged into the main branch https://github.com/BerriAI/litellm/pull/5542/files

duodecanol avatar Dec 26 '24 09:12 duodecanol

I have a very similar issue where httpcore.ReadError is raised while iterating through the stream. The fallbacks are not triggered. The PR mentioned above does not fix this issue. If any error happens while iterating through the stream, the request ends with error without any retry or fallback.

You can run this test to reproduce:

@pytest.mark.asyncio
async def test_streaming_fallbacks():
    litellm.set_verbose = True

    router = Router(
        model_list=[
            {
                "model_name": "anthropic/claude-3-5-sonnet-20240620",
                "litellm_params": {
                    "model": "anthropic/claude-3-5-sonnet-20240620",
                },
            },
            {
                "model_name": "gpt-3.5-turbo",
                "litellm_params": {
                    "model": "gpt-3.5-turbo",
                    "mock_response": "This is a mock response",
                },
            },
        ],
        fallbacks=[{"anthropic/claude-3-5-sonnet-20240620": ["gpt-3.5-turbo"]}],
        num_retries=3,
    )

    with patch('litellm.llms.anthropic.chat.handler.ModelResponseIterator.__anext__', side_effect=httpcore.ReadError('Simulated error')):
        response = await router.acompletion(
            model="anthropic/claude-3-5-sonnet-20240620",
            messages=[{"role": "user", "content": "Hey, how's it going?"}],
            stream=True,
        )
        async for chunk in response:
            print(chunk)

adrian-streetbeat avatar Feb 06 '25 14:02 adrian-streetbeat

hi @adrian-streetbeat would you expect a retry/fallback mid-stream?

krrishdholakia avatar Feb 06 '25 14:02 krrishdholakia

For my similar case #8632, it is set up as streaming, but failure seems to occur before I get any data - not sure if that is considered mid stream. In this case I would expect fallback to work, yes.

clarity99 avatar Feb 21 '25 08:02 clarity99

Hey @clarity99

litellm-1 | File "/usr/lib/python3.13/site-packages/litellm/proxy/proxy_server.py", line 3018, in async_data_generator litellm-1 | async for chunk in response: litellm-1 | ...<14 lines>... litellm-1 | yield f"data: {str(e)}\n\n"

based on your stacktrace - it looks like it happened after the stream had started

since this happens before any data - maybe this is a situation where gemini is returning the error in the first streamed response (something we should handle)

i'll try to repro this and follow up

krrishdholakia avatar Feb 21 '25 15:02 krrishdholakia

Same problem here with streaming and bedrock. From time to time it throws an:

serviceUnavailableException {"message":"Bedrock is unable to process your request."}

which does not trigger fallbacks, but there was definitely no token streamed.

jonas-lyrebird-health avatar Feb 23 '25 10:02 jonas-lyrebird-health

Also same here, with no output tokens streamed.

Arokha avatar Feb 26 '25 03:02 Arokha

I've picked this up -- need help in the fallback behaviour.

When we face error mid-stream, the user has already iterated through few tokens from this stream. If we retry with fallback model: Should the response continue streaming with the new response or what is the expected behavior here?

Should we have a special case where the error is in the first chunk of the stream -- this would make sense to retry with a fallback as the user has not consumed any tokens.

madhukar01 avatar Mar 05 '25 00:03 madhukar01

Hey @madhukar01

Why not

  • if first chunk of stream -> retry / fallback as expected (user saw nothing, so no impact)
  • if mid-stream -> require flag (ideally can be passed in dynamically - litellm_params OR globally litellm_settings), this allows the developer to opt into this behaviour

krrishdholakia avatar Mar 05 '25 01:03 krrishdholakia

Is this issue still being worked on?

My current workaround is to catch serviceUnavailableException and implement my own fallback / retry.

It is still a bit annoying since I configured my own litellm failure webhook which should only trigger if the fallback is not called.

jonas-lyrebird-health avatar Mar 13 '25 03:03 jonas-lyrebird-health

Hey @jonas-lyrebird-health a PR here is welcome, if you're open to it

krrishdholakia avatar Mar 13 '25 04:03 krrishdholakia

this one still happens, especially on gemini (ai studio) endpoints.

https://github.com/user-attachments/assets/61dde4aa-a00a-45d1-a753-4aa705c747d5

yigitkonur avatar Apr 08 '25 20:04 yigitkonur

as a lot more people going to use AI Studio with Gemini 2.5 Pro and as their rate limits are serious issue, I believe a lot of people will see this problem. know that you guys @krrishdholakia and @ishaan-jaff are too busy to scale up the company, it really impacts our workflow and the PR of such issue is not very easy for someone else. any chance to have a look? (i see even some bedrock users had this mid-stream issues) and will really appreciate a hotfix.

yigitkonur avatar Apr 08 '25 20:04 yigitkonur

@jonas-lyrebird-health @Arokha @clarity99 any workaround you've found?

yigitkonur avatar Apr 08 '25 20:04 yigitkonur

Hey @madhukar01

Why not

  • if first chunk of stream -> retry / fallback as expected (user saw nothing, so no impact)
  • if mid-stream -> require flag (ideally can be passed in dynamically - litellm_params OR globally litellm_settings), this allows the developer to opt into this behaviour

It seem that has been done by #9809 @krrishdholakia

guanbo avatar May 15 '25 06:05 guanbo

Closing as this is now fixed on main

krrishdholakia avatar Aug 02 '25 18:08 krrishdholakia

https://github.com/user-attachments/assets/1740b3c9-6f63-4c1b-82cd-6f9228d798ed

krrishdholakia avatar Aug 02 '25 18:08 krrishdholakia