litellm [Bug]: Fallbacks not working with Ollama models when streaming is on

What happened?

When using litellm to interact with Ollama models and fallbacks are configured, the fallback mechanism does not function correctly when the stream=True option is used.

Steps to Reproduce

Configure litellm with one Ollama model (or more in load balance) as the primary model and a fallback model (e.g., another Ollama model or an OpenAI model). Relevant config.yaml:

model_list:
  - model_name: "llama3.2:latest"
    litellm_params:
      model: "ollama/llama3.2:latest"
      api_base: "http://localhost:1234"
      api_type: "open_ai"
  - model_name: "gpt-4o-mini"
    litellm_params:
      model: "openai/gpt-4o-mini"
      api_key: "os.environ/GITHUB_API_KEY"
    
router_settings:
  num_retries: 0
  retry_after: 0
  allowed_fails: 1
  cooldown_time: 300
  fallbacks:
  - llama3.2:latest:
    - gpt-4o-mini

litellm_settings:
  json_logs: true

Make a request to litellm proxy ollama model with stream=True and fallback.

curl --location 'http://localhost:4000/v1/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer sk-1234' \
--data-raw '{
    "stream": true,
    "model": "llama3.2:latest",
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful coding assistant"
        },
        {
            "role": "user",
            "content": "Who are you?"
        }
    ],
    "fallbacks": [
        "gpt-4o-mini"
    ],
    "num_retries": 0,
    "request_timeout": 3
}'

Observe that the fallback model is not invoked, and the request fails returning this response:

data: {"error": {"message": "", "type": "None", "param": "None", "code": "502"}}

and also triggers the TypeError exception shown in PR #6281

Expected behavior

When a request triggers the fallback logic, even with stream=True, the fallback model should be seamlessly invoked, and the response should be streamed from the fallback model.

Environment:

litellm version: 1.49.6 (from 2024-10-17)
Python version: v3.11.8
Operating System/install method: Debian 12 / Docker compose via main-latest branch

Notes:

fallback works with stream = False
the failed ollama model is never put in cooldown
subsequent requests also fails

Relevant log output

{"message": "litellm.proxy.proxy_server.async_data_generator(): Exception occured - b''", "level": "ERROR", "timestamp": "2024-10-17T19:29:21.683280"}
Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/starlette/responses.py", line 265, in __call__
    await wrap(partial(self.listen_for_disconnect, receive))
  File "/usr/local/lib/python3.11/site-packages/starlette/responses.py", line 261, in wrap
    await func()
  File "/usr/local/lib/python3.11/site-packages/starlette/responses.py", line 238, in listen_for_disconnect
    message = await receive()
              ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/uvicorn/protocols/http/httptools_impl.py", line 568, in receive
    await self.message_event.wait()
  File "/usr/local/lib/python3.11/asyncio/locks.py", line 213, in wait
    await fut
asyncio.exceptions.CancelledError: Cancelled by cancel scope 7fbce820db90

During handling of the above exception, another exception occurred:

  + Exception Group Traceback (most recent call last):
  |   File "/usr/local/lib/python3.11/site-packages/uvicorn/protocols/http/httptools_impl.py", line 411, in run_asgi
  |     result = await app(  # type: ignore[func-returns-value]
  |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  |   File "/usr/local/lib/python3.11/site-packages/uvicorn/middleware/proxy_headers.py", line 69, in __call__
  |     return await self.app(scope, receive, send)
  |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  |   File "/usr/local/lib/python3.11/site-packages/fastapi/applications.py", line 1054, in __call__
  |     await super().__call__(scope, receive, send)
  |   File "/usr/local/lib/python3.11/site-packages/starlette/applications.py", line 123, in __call__
  |     await self.middleware_stack(scope, receive, send)
  |   File "/usr/local/lib/python3.11/site-packages/starlette/middleware/errors.py", line 186, in __call__
  |     raise exc
  |   File "/usr/local/lib/python3.11/site-packages/starlette/middleware/errors.py", line 164, in __call__
  |     await self.app(scope, receive, _send)
  |   File "/usr/local/lib/python3.11/site-packages/starlette/middleware/cors.py", line 85, in __call__
  |     await self.app(scope, receive, send)
  |   File "/usr/local/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 65, in __call__
  |     await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  |   File "/usr/local/lib/python3.11/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
  |     raise exc
  |   File "/usr/local/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
  |     await app(scope, receive, sender)
  |   File "/usr/local/lib/python3.11/site-packages/starlette/routing.py", line 756, in __call__
  |     await self.middleware_stack(scope, receive, send)
  |   File "/usr/local/lib/python3.11/site-packages/starlette/routing.py", line 776, in app
  |     await route.handle(scope, receive, send)
  |   File "/usr/local/lib/python3.11/site-packages/starlette/routing.py", line 297, in handle
  |     await self.app(scope, receive, send)
  |   File "/usr/local/lib/python3.11/site-packages/starlette/routing.py", line 77, in app
  |     await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  |   File "/usr/local/lib/python3.11/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
  |     raise exc
  |   File "/usr/local/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
  |     await app(scope, receive, sender)
  |   File "/usr/local/lib/python3.11/site-packages/starlette/routing.py", line 75, in app
  |     await response(scope, receive, send)
  |   File "/usr/local/lib/python3.11/site-packages/starlette/responses.py", line 258, in __call__
  |     async with anyio.create_task_group() as task_group:
  |   File "/usr/local/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 680, in __aexit__
  |     raise BaseExceptionGroup(
  | ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
  +-+---------------- 1 ----------------
    | Traceback (most recent call last):
    |   File "/usr/local/lib/python3.11/site-packages/litellm/proxy/proxy_server.py", line 2579, in async_data_generator
    |     async for chunk in response:
    |   File "/usr/local/lib/python3.11/site-packages/litellm/llms/ollama.py", line 443, in ollama_async_streaming
    |     raise e  # don't use verbose_logger.exception, if exception is raised
    |     ^^^^^^^
    |   File "/usr/local/lib/python3.11/site-packages/litellm/llms/ollama.py", line 386, in ollama_async_streaming
    |     raise OllamaError(
    | litellm.llms.ollama.OllamaError: b''
    |
    | During handling of the above exception, another exception occurred:
    |
    | Traceback (most recent call last):
    |   File "/usr/local/lib/python3.11/site-packages/starlette/responses.py", line 261, in wrap
    |     await func()
    |   File "/usr/local/lib/python3.11/site-packages/starlette/responses.py", line 250, in stream_response
    |     async for chunk in self.body_iterator:
    |   File "/usr/local/lib/python3.11/site-packages/litellm/proxy/proxy_server.py", line 2620, in async_data_generator
    |     proxy_exception = ProxyException(
    |                       ^^^^^^^^^^^^^^^
    |   File "/usr/local/lib/python3.11/site-packages/litellm/proxy/_types.py", line 1839, in __init__
    |     "No healthy deployment available" in self.message
    | TypeError: a bytes-like object is required, not 'str'
    +------------------------------------

Twitter / LinkedIn details

No response

Oct 17 '24 23:10 bgeneto

Has anyone confirmed it? It's a core function, at least for self-hosted Ollama models where failures tend to be more frequent!

Oct 20 '24 20:10 bgeneto

I’m facing a similar issue while using the API. Here’s the call I’m making:

curl --location 'http://0.0.0.0:4000/v1/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: XXXXX' \
--data '{
  "stream": true,
  "model": "gemma2:9b",
  "messages": [
    {
      "role": "user",
      "content": "How can I get goto folder option while upload box in mac os?"
    }
  ],
  "fallbacks": ["gpt-4o-mini"]
}'

Response:

curl: (18) transfer closed with outstanding read data remaining

Expected Behavior: The request should gracefully fall back to the gpt-4o-mini model when the primary model fails.

This version is concise, formatted for clarity, and outlines the problem with expected behavior for better understanding.Expected Behaviour it should fallback to gpt-4o-mini

Nov 21 '24 17:11 dakshitgm

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

Feb 20 '25 00:02 github-actions[bot]