[Bug]: NO FALLBACK when streaming and [Bug]: litellm.InternalServerError: AnthropicException - Overloaded. Handle with litellm.InternalServerError.
What happened?
No fallback when streaming basicly the same problem like https://github.com/BerriAI/litellm/issues/6532 with very similar config
Router(
model_list=settings.LITELLM_MODEL_DEPLOYMENTS,
num_retries=3,
retry_after=5, # waits min 5s before retrying request
timeout=290,
allowed_fails=3, # cooldown model if it fails > 3 call in a minute.
cooldown_time=10, # cooldown the deployment for 10 seconds if it num_fails > allowed_fails
default_fallbacks=["claude-3-5-sonnet-aws"],
)
# model = claude3.5 anthropic
# fallback model = claude3.5 aws bedrock
Relevant log output
No response
Twitter / LinkedIn details
No response
@krrishdholakia i found a way to reproduce this problem add these two lines of code
type_chunk = "error"
chunk["error"] = {"message": "Overload test error"}
here https://github.com/BerriAI/litellm/blob/main/litellm/llms/anthropic/chat/handler.py#L561
then call with this configuration
import asyncio
from litellm import Router
model_deployments = [
{
"model_name": "claude-3-5-sonnet",
"litellm_params": {
"model": "claude-3-5-sonnet-20241022",
"api_key": "key",
},
"rpm": 4000,
},
{
"model_name": "claude-3-5-sonnet-aws",
"litellm_params": {
"model": "bedrock/anthropic.claude-3-5-sonnet-20240620-v1:0",
"aws_access_key_id": "key",
"aws_secret_access_key": "key",
"aws_region_name": "us-east-1",
},
"rpm": 50,
},
]
messages = [
{
"role": "system",
"content": [
{
"type": "text",
"text": "You are a helpful assistant",
"cache_control": {"type": "ephemeral"},
}
],
},
{"role": "user", "content": "hello"},
]
stop_sequences = None
_router = Router(
model_list=model_deployments,
default_fallbacks=["claude-3-5-sonnet-aws"],
)
async def async_completion():
response = await _router.acompletion(
model="claude-3-5-sonnet",
messages=messages,
stream=True,
stop=stop_sequences,
stream_options={"include_usage": True},
)
async for chunk in response:
pass
print(42)
asyncio.run(async_completion())
thanks @laol777 will run this today
@krrishdholakia hello, any updates on this issue?
hmm i would assume this issue is caused due to the request failing while iterating through the stream.
This PR was not merged into the main branch https://github.com/BerriAI/litellm/pull/5542/files
I have a very similar issue where httpcore.ReadError is raised while iterating through the stream. The fallbacks are not triggered. The PR mentioned above does not fix this issue. If any error happens while iterating through the stream, the request ends with error without any retry or fallback.
You can run this test to reproduce:
@pytest.mark.asyncio
async def test_streaming_fallbacks():
litellm.set_verbose = True
router = Router(
model_list=[
{
"model_name": "anthropic/claude-3-5-sonnet-20240620",
"litellm_params": {
"model": "anthropic/claude-3-5-sonnet-20240620",
},
},
{
"model_name": "gpt-3.5-turbo",
"litellm_params": {
"model": "gpt-3.5-turbo",
"mock_response": "This is a mock response",
},
},
],
fallbacks=[{"anthropic/claude-3-5-sonnet-20240620": ["gpt-3.5-turbo"]}],
num_retries=3,
)
with patch('litellm.llms.anthropic.chat.handler.ModelResponseIterator.__anext__', side_effect=httpcore.ReadError('Simulated error')):
response = await router.acompletion(
model="anthropic/claude-3-5-sonnet-20240620",
messages=[{"role": "user", "content": "Hey, how's it going?"}],
stream=True,
)
async for chunk in response:
print(chunk)
hi @adrian-streetbeat would you expect a retry/fallback mid-stream?
For my similar case #8632, it is set up as streaming, but failure seems to occur before I get any data - not sure if that is considered mid stream. In this case I would expect fallback to work, yes.
Hey @clarity99
litellm-1 | File "/usr/lib/python3.13/site-packages/litellm/proxy/proxy_server.py", line 3018, in async_data_generator litellm-1 | async for chunk in response: litellm-1 | ...<14 lines>... litellm-1 | yield f"data: {str(e)}\n\n"
based on your stacktrace - it looks like it happened after the stream had started
since this happens before any data - maybe this is a situation where gemini is returning the error in the first streamed response (something we should handle)
i'll try to repro this and follow up
Same problem here with streaming and bedrock. From time to time it throws an:
serviceUnavailableException {"message":"Bedrock is unable to process your request."}
which does not trigger fallbacks, but there was definitely no token streamed.
Also same here, with no output tokens streamed.
I've picked this up -- need help in the fallback behaviour.
When we face error mid-stream, the user has already iterated through few tokens from this stream. If we retry with fallback model: Should the response continue streaming with the new response or what is the expected behavior here?
Should we have a special case where the error is in the first chunk of the stream -- this would make sense to retry with a fallback as the user has not consumed any tokens.
Hey @madhukar01
Why not
- if first chunk of stream -> retry / fallback as expected (user saw nothing, so no impact)
- if mid-stream -> require flag (ideally can be passed in dynamically -
litellm_paramsOR globallylitellm_settings), this allows the developer to opt into this behaviour
Is this issue still being worked on?
My current workaround is to catch serviceUnavailableException and implement my own fallback / retry.
It is still a bit annoying since I configured my own litellm failure webhook which should only trigger if the fallback is not called.
Hey @jonas-lyrebird-health a PR here is welcome, if you're open to it
this one still happens, especially on gemini (ai studio) endpoints.
https://github.com/user-attachments/assets/61dde4aa-a00a-45d1-a753-4aa705c747d5
as a lot more people going to use AI Studio with Gemini 2.5 Pro and as their rate limits are serious issue, I believe a lot of people will see this problem. know that you guys @krrishdholakia and @ishaan-jaff are too busy to scale up the company, it really impacts our workflow and the PR of such issue is not very easy for someone else. any chance to have a look? (i see even some bedrock users had this mid-stream issues) and will really appreciate a hotfix.
@jonas-lyrebird-health @Arokha @clarity99 any workaround you've found?
Hey @madhukar01
Why not
- if first chunk of stream -> retry / fallback as expected (user saw nothing, so no impact)
- if mid-stream -> require flag (ideally can be passed in dynamically -
litellm_paramsOR globallylitellm_settings), this allows the developer to opt into this behaviour
It seem that has been done by #9809 @krrishdholakia
Closing as this is now fixed on main
https://github.com/user-attachments/assets/1740b3c9-6f63-4c1b-82cd-6f9228d798ed