[Bug]: Fallbacks not working with Ollama models when streaming is on
What happened?
When using litellm to interact with Ollama models and fallbacks are configured, the fallback mechanism does not function correctly when the stream=True option is used.
Steps to Reproduce
- Configure
litellmwith one Ollama model (or more in load balance) as the primary model and a fallback model (e.g., another Ollama model or an OpenAI model). Relevantconfig.yaml:
model_list:
- model_name: "llama3.2:latest"
litellm_params:
model: "ollama/llama3.2:latest"
api_base: "http://localhost:1234"
api_type: "open_ai"
- model_name: "gpt-4o-mini"
litellm_params:
model: "openai/gpt-4o-mini"
api_key: "os.environ/GITHUB_API_KEY"
router_settings:
num_retries: 0
retry_after: 0
allowed_fails: 1
cooldown_time: 300
fallbacks:
- llama3.2:latest:
- gpt-4o-mini
litellm_settings:
json_logs: true
- Make a request to
litellmproxy ollama model withstream=Trueand fallback.
curl --location 'http://localhost:4000/v1/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer sk-1234' \
--data-raw '{
"stream": true,
"model": "llama3.2:latest",
"messages": [
{
"role": "system",
"content": "You are a helpful coding assistant"
},
{
"role": "user",
"content": "Who are you?"
}
],
"fallbacks": [
"gpt-4o-mini"
],
"num_retries": 0,
"request_timeout": 3
}'
- Observe that the fallback model is not invoked, and the request fails returning this response:
data: {"error": {"message": "", "type": "None", "param": "None", "code": "502"}}
and also triggers the TypeError exception shown in PR #6281
Expected behavior
When a request triggers the fallback logic, even with stream=True, the fallback model should be seamlessly invoked, and the response should be streamed from the fallback model.
Environment:
litellmversion: 1.49.6 (from 2024-10-17)- Python version: v3.11.8
- Operating System/install method: Debian 12 / Docker compose via main-latest branch
Notes:
- fallback works with
stream = False - the failed ollama model is never put in cooldown
- subsequent requests also fails
Relevant log output
{"message": "litellm.proxy.proxy_server.async_data_generator(): Exception occured - b''", "level": "ERROR", "timestamp": "2024-10-17T19:29:21.683280"}
Exception in ASGI application
Traceback (most recent call last):
File "/usr/local/lib/python3.11/site-packages/starlette/responses.py", line 265, in __call__
await wrap(partial(self.listen_for_disconnect, receive))
File "/usr/local/lib/python3.11/site-packages/starlette/responses.py", line 261, in wrap
await func()
File "/usr/local/lib/python3.11/site-packages/starlette/responses.py", line 238, in listen_for_disconnect
message = await receive()
^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/uvicorn/protocols/http/httptools_impl.py", line 568, in receive
await self.message_event.wait()
File "/usr/local/lib/python3.11/asyncio/locks.py", line 213, in wait
await fut
asyncio.exceptions.CancelledError: Cancelled by cancel scope 7fbce820db90
During handling of the above exception, another exception occurred:
+ Exception Group Traceback (most recent call last):
| File "/usr/local/lib/python3.11/site-packages/uvicorn/protocols/http/httptools_impl.py", line 411, in run_asgi
| result = await app( # type: ignore[func-returns-value]
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File "/usr/local/lib/python3.11/site-packages/uvicorn/middleware/proxy_headers.py", line 69, in __call__
| return await self.app(scope, receive, send)
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File "/usr/local/lib/python3.11/site-packages/fastapi/applications.py", line 1054, in __call__
| await super().__call__(scope, receive, send)
| File "/usr/local/lib/python3.11/site-packages/starlette/applications.py", line 123, in __call__
| await self.middleware_stack(scope, receive, send)
| File "/usr/local/lib/python3.11/site-packages/starlette/middleware/errors.py", line 186, in __call__
| raise exc
| File "/usr/local/lib/python3.11/site-packages/starlette/middleware/errors.py", line 164, in __call__
| await self.app(scope, receive, _send)
| File "/usr/local/lib/python3.11/site-packages/starlette/middleware/cors.py", line 85, in __call__
| await self.app(scope, receive, send)
| File "/usr/local/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 65, in __call__
| await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
| File "/usr/local/lib/python3.11/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
| raise exc
| File "/usr/local/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
| await app(scope, receive, sender)
| File "/usr/local/lib/python3.11/site-packages/starlette/routing.py", line 756, in __call__
| await self.middleware_stack(scope, receive, send)
| File "/usr/local/lib/python3.11/site-packages/starlette/routing.py", line 776, in app
| await route.handle(scope, receive, send)
| File "/usr/local/lib/python3.11/site-packages/starlette/routing.py", line 297, in handle
| await self.app(scope, receive, send)
| File "/usr/local/lib/python3.11/site-packages/starlette/routing.py", line 77, in app
| await wrap_app_handling_exceptions(app, request)(scope, receive, send)
| File "/usr/local/lib/python3.11/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
| raise exc
| File "/usr/local/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
| await app(scope, receive, sender)
| File "/usr/local/lib/python3.11/site-packages/starlette/routing.py", line 75, in app
| await response(scope, receive, send)
| File "/usr/local/lib/python3.11/site-packages/starlette/responses.py", line 258, in __call__
| async with anyio.create_task_group() as task_group:
| File "/usr/local/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 680, in __aexit__
| raise BaseExceptionGroup(
| ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
+-+---------------- 1 ----------------
| Traceback (most recent call last):
| File "/usr/local/lib/python3.11/site-packages/litellm/proxy/proxy_server.py", line 2579, in async_data_generator
| async for chunk in response:
| File "/usr/local/lib/python3.11/site-packages/litellm/llms/ollama.py", line 443, in ollama_async_streaming
| raise e # don't use verbose_logger.exception, if exception is raised
| ^^^^^^^
| File "/usr/local/lib/python3.11/site-packages/litellm/llms/ollama.py", line 386, in ollama_async_streaming
| raise OllamaError(
| litellm.llms.ollama.OllamaError: b''
|
| During handling of the above exception, another exception occurred:
|
| Traceback (most recent call last):
| File "/usr/local/lib/python3.11/site-packages/starlette/responses.py", line 261, in wrap
| await func()
| File "/usr/local/lib/python3.11/site-packages/starlette/responses.py", line 250, in stream_response
| async for chunk in self.body_iterator:
| File "/usr/local/lib/python3.11/site-packages/litellm/proxy/proxy_server.py", line 2620, in async_data_generator
| proxy_exception = ProxyException(
| ^^^^^^^^^^^^^^^
| File "/usr/local/lib/python3.11/site-packages/litellm/proxy/_types.py", line 1839, in __init__
| "No healthy deployment available" in self.message
| TypeError: a bytes-like object is required, not 'str'
+------------------------------------
Twitter / LinkedIn details
No response
Has anyone confirmed it? It's a core function, at least for self-hosted Ollama models where failures tend to be more frequent!
I’m facing a similar issue while using the API. Here’s the call I’m making:
curl --location 'http://0.0.0.0:4000/v1/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: XXXXX' \
--data '{
"stream": true,
"model": "gemma2:9b",
"messages": [
{
"role": "user",
"content": "How can I get goto folder option while upload box in mac os?"
}
],
"fallbacks": ["gpt-4o-mini"]
}'
Response:
curl: (18) transfer closed with outstanding read data remaining
Expected Behavior: The request should gracefully fall back to the gpt-4o-mini model when the primary model fails.
This version is concise, formatted for clarity, and outlines the problem with expected behavior for better understanding.Expected Behaviour it should fallback to gpt-4o-mini
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.