litellm [Bug]: NO FALLBACK when streaming and [Bug]: litellm.InternalServerError: AnthropicException - Overloaded. Handle with litellm.InternalServerError.

What happened?

No fallback when streaming basicly the same problem like https://github.com/BerriAI/litellm/issues/6532 with very similar config

Router(
        model_list=settings.LITELLM_MODEL_DEPLOYMENTS,
        num_retries=3,
        retry_after=5,  # waits min 5s before retrying request
        timeout=290,
        allowed_fails=3,  # cooldown model if it fails > 3 call in a minute.
        cooldown_time=10,  # cooldown the deployment for 10 seconds if it num_fails > allowed_fails
        default_fallbacks=["claude-3-5-sonnet-aws"],
    )
# model = claude3.5 anthropic
# fallback model = claude3.5 aws bedrock

Relevant log output

No response

Twitter / LinkedIn details

No response

Nov 28 '24 11:11 laol777

@krrishdholakia i found a way to reproduce this problem add these two lines of code

type_chunk = "error"
chunk["error"] = {"message": "Overload test error"}

here https://github.com/BerriAI/litellm/blob/main/litellm/llms/anthropic/chat/handler.py#L561

then call with this configuration

import asyncio

from litellm import Router

model_deployments = [
    {
        "model_name": "claude-3-5-sonnet",
        "litellm_params": {
            "model": "claude-3-5-sonnet-20241022",
            "api_key": "key",
        },
        "rpm": 4000,
    },
    {
        "model_name": "claude-3-5-sonnet-aws",
        "litellm_params": {
            "model": "bedrock/anthropic.claude-3-5-sonnet-20240620-v1:0",
            "aws_access_key_id": "key",
            "aws_secret_access_key": "key",
            "aws_region_name": "us-east-1",
        },
        "rpm": 50,
    },
]

messages = [
    {
        "role": "system",
        "content": [
            {
                "type": "text",
                "text": "You are a helpful assistant",
                "cache_control": {"type": "ephemeral"},
            }
        ],
    },
    {"role": "user", "content": "hello"},
]

stop_sequences = None
_router = Router(
    model_list=model_deployments,
    default_fallbacks=["claude-3-5-sonnet-aws"],
)


async def async_completion():
    response = await _router.acompletion(
        model="claude-3-5-sonnet",
        messages=messages,
        stream=True,
        stop=stop_sequences,
        stream_options={"include_usage": True},
    )

    async for chunk in response:
        pass

    print(42)

asyncio.run(async_completion())

Dec 04 '24 15:12 laol777

thanks @laol777 will run this today

Dec 04 '24 16:12 krrishdholakia

@krrishdholakia hello, any updates on this issue?

Dec 09 '24 15:12 laol777

hmm i would assume this issue is caused due to the request failing while iterating through the stream.

Dec 17 '24 17:12 krrishdholakia

This PR was not merged into the main branch https://github.com/BerriAI/litellm/pull/5542/files

Dec 26 '24 09:12 duodecanol

I have a very similar issue where httpcore.ReadError is raised while iterating through the stream. The fallbacks are not triggered. The PR mentioned above does not fix this issue. If any error happens while iterating through the stream, the request ends with error without any retry or fallback.

You can run this test to reproduce:

@pytest.mark.asyncio
async def test_streaming_fallbacks():
    litellm.set_verbose = True

    router = Router(
        model_list=[
            {
                "model_name": "anthropic/claude-3-5-sonnet-20240620",
                "litellm_params": {
                    "model": "anthropic/claude-3-5-sonnet-20240620",
                },
            },
            {
                "model_name": "gpt-3.5-turbo",
                "litellm_params": {
                    "model": "gpt-3.5-turbo",
                    "mock_response": "This is a mock response",
                },
            },
        ],
        fallbacks=[{"anthropic/claude-3-5-sonnet-20240620": ["gpt-3.5-turbo"]}],
        num_retries=3,
    )

    with patch('litellm.llms.anthropic.chat.handler.ModelResponseIterator.__anext__', side_effect=httpcore.ReadError('Simulated error')):
        response = await router.acompletion(
            model="anthropic/claude-3-5-sonnet-20240620",
            messages=[{"role": "user", "content": "Hey, how's it going?"}],
            stream=True,
        )
        async for chunk in response:
            print(chunk)

Feb 06 '25 14:02 adrian-streetbeat

hi @adrian-streetbeat would you expect a retry/fallback mid-stream?

Feb 06 '25 14:02 krrishdholakia

For my similar case #8632, it is set up as streaming, but failure seems to occur before I get any data - not sure if that is considered mid stream. In this case I would expect fallback to work, yes.

Feb 21 '25 08:02 clarity99

Hey @clarity99

litellm-1 | File "/usr/lib/python3.13/site-packages/litellm/proxy/proxy_server.py", line 3018, in async_data_generator litellm-1 | async for chunk in response: litellm-1 | ...<14 lines>... litellm-1 | yield f"data: {str(e)}\n\n"

based on your stacktrace - it looks like it happened after the stream had started

since this happens before any data - maybe this is a situation where gemini is returning the error in the first streamed response (something we should handle)

i'll try to repro this and follow up

Feb 21 '25 15:02 krrishdholakia

Same problem here with streaming and bedrock. From time to time it throws an:

serviceUnavailableException {"message":"Bedrock is unable to process your request."}

which does not trigger fallbacks, but there was definitely no token streamed.

Feb 23 '25 10:02 jonas-lyrebird-health

Also same here, with no output tokens streamed.

Feb 26 '25 03:02 Arokha

I've picked this up -- need help in the fallback behaviour.

When we face error mid-stream, the user has already iterated through few tokens from this stream. If we retry with fallback model: Should the response continue streaming with the new response or what is the expected behavior here?

Should we have a special case where the error is in the first chunk of the stream -- this would make sense to retry with a fallback as the user has not consumed any tokens.

Mar 05 '25 00:03 madhukar01

Hey @madhukar01

Why not

if first chunk of stream -> retry / fallback as expected (user saw nothing, so no impact)
if mid-stream -> require flag (ideally can be passed in dynamically - litellm_params OR globally litellm_settings), this allows the developer to opt into this behaviour

Mar 05 '25 01:03 krrishdholakia

Is this issue still being worked on?

My current workaround is to catch serviceUnavailableException and implement my own fallback / retry.

It is still a bit annoying since I configured my own litellm failure webhook which should only trigger if the fallback is not called.

Mar 13 '25 03:03 jonas-lyrebird-health

Hey @jonas-lyrebird-health a PR here is welcome, if you're open to it

Mar 13 '25 04:03 krrishdholakia

this one still happens, especially on gemini (ai studio) endpoints.

https://github.com/user-attachments/assets/61dde4aa-a00a-45d1-a753-4aa705c747d5

Apr 08 '25 20:04 yigitkonur

as a lot more people going to use AI Studio with Gemini 2.5 Pro and as their rate limits are serious issue, I believe a lot of people will see this problem. know that you guys @krrishdholakia and @ishaan-jaff are too busy to scale up the company, it really impacts our workflow and the PR of such issue is not very easy for someone else. any chance to have a look? (i see even some bedrock users had this mid-stream issues) and will really appreciate a hotfix.

Apr 08 '25 20:04 yigitkonur

@jonas-lyrebird-health @Arokha @clarity99 any workaround you've found?

Apr 08 '25 20:04 yigitkonur

Hey @madhukar01

Why not

if first chunk of stream -> retry / fallback as expected (user saw nothing, so no impact)

if mid-stream -> require flag (ideally can be passed in dynamically - litellm_params OR globally litellm_settings), this allows the developer to opt into this behaviour

It seem that has been done by #9809 @krrishdholakia

May 15 '25 06:05 guanbo

Closing as this is now fixed on main

Aug 02 '25 18:08 krrishdholakia

https://github.com/user-attachments/assets/1740b3c9-6f63-4c1b-82cd-6f9228d798ed

Aug 02 '25 18:08 krrishdholakia