[Bug]: Fallback models dont work for CustomLLM on streaming endpoints
What happened?
I am working on a customLLM in a proxy server, where the customLLM is called with a streaming request and should fallback on another model if the request fails/times out. The exceptions are raised, but this does not switch to the fallback, however for the same setup this does work for the non-streaming calls. I feel the behavior should be the same.
Relevant log output
litellm-1 | INFO: Started server process [1]
litellm-1 | INFO: Waiting for application startup.
litellm-1 |
litellm-1 | #------------------------------------------------------------#
litellm-1 | # #
litellm-1 | # 'This product would be better if...' #
litellm-1 | # https://github.com/BerriAI/litellm/issues/new #
litellm-1 | # #
litellm-1 | #------------------------------------------------------------#
litellm-1 |
litellm-1 | Thank you for using LiteLLM! - Krrish & Ishaan
litellm-1 |
litellm-1 |
litellm-1 |
litellm-1 | Give Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new
litellm-1 |
litellm-1 |
litellm-1 | LiteLLM: Proxy initialized with Config, Set models:
litellm-1 | qwen_qwen3_235b_a22b
litellm-1 | meta_llama_llama_3_1_8b_instruct
litellm-1 | meta_llama_llama_3_1_8b_instruct_fake
litellm-1 | gpt4o
litellm-1 | gpt4o_test
litellm-1 | 08:56:47 - LiteLLM:DEBUG: utils.py:4244 - checking potential_model_names in litellm.model_cost: {'split_model': 'b50a91c389bb76883145f5df3dcdf0ab5103c0f16da0a1be09eb3562b451b8d9', 'combined_model_name': 'b50a91c389bb76883145f5df3dcdf0ab5103c0f16da0a1be09eb3562b451b8d9', 'stripped_model_name': 'b50a91c389bb76883145f5df3dcdf0ab5103c0f16da0a1be09eb3562b451b8d9', 'combined_stripped_model_name': 'b50a91c389bb76883145f5df3dcdf0ab5103c0f16da0a1be09eb3562b451b8d9', 'custom_llm_provider': None}
litellm-1 | 08:56:47 - LiteLLM:DEBUG: utils.py:4453 - Error getting model info: This model isn't mapped yet. Add it here - https://github.com/BerriAI/litellm/blob/main/model_prices_and_context_window.json
litellm-1 | 08:56:47 - LiteLLM:DEBUG: utils.py:2052 - added/updated model=b50a91c389bb76883145f5df3dcdf0ab5103c0f16da0a1be09eb3562b451b8d9 in litellm.model_cost: b50a91c389bb76883145f5df3dcdf0ab5103c0f16da0a1be09eb3562b451b8d9
litellm-1 | 08:56:47 - LiteLLM:DEBUG: utils.py:4244 - checking potential_model_names in litellm.model_cost: {'split_model': 'qwen_qwen3_235b_a22b', 'combined_model_name': 'serval-llm/qwen_qwen3_235b_a22b', 'stripped_model_name': 'serval-llm/qwen_qwen3_235b_a22b', 'combined_stripped_model_name': 'serval-llm/qwen_qwen3_235b_a22b', 'custom_llm_provider': 'serval-llm'}
litellm-1 | 08:56:47 - LiteLLM:DEBUG: utils.py:4453 - Error getting model info: This model isn't mapped yet. Add it here - https://github.com/BerriAI/litellm/blob/main/model_prices_and_context_window.json
litellm-1 | 08:56:47 - LiteLLM:DEBUG: utils.py:2052 - added/updated model=serval-llm/qwen_qwen3_235b_a22b in litellm.model_cost: serval-llm/qwen_qwen3_235b_a22b
litellm-1 | 08:56:47 - LiteLLM:DEBUG: utils.py:4244 - checking potential_model_names in litellm.model_cost: {'split_model': 'd65327edefb8f1ff6691feef7a1122b0d616fd0167c19b62a81f619dfe4f2009', 'combined_model_name': 'd65327edefb8f1ff6691feef7a1122b0d616fd0167c19b62a81f619dfe4f2009', 'stripped_model_name': 'd65327edefb8f1ff6691feef7a1122b0d616fd0167c19b62a81f619dfe4f2009', 'combined_stripped_model_name': 'd65327edefb8f1ff6691feef7a1122b0d616fd0167c19b62a81f619dfe4f2009', 'custom_llm_provider': None}
litellm-1 | 08:56:47 - LiteLLM:DEBUG: utils.py:4453 - Error getting model info: This model isn't mapped yet. Add it here - https://github.com/BerriAI/litellm/blob/main/model_prices_and_context_window.json
litellm-1 | 08:56:47 - LiteLLM:DEBUG: utils.py:2052 - added/updated model=d65327edefb8f1ff6691feef7a1122b0d616fd0167c19b62a81f619dfe4f2009 in litellm.model_cost: d65327edefb8f1ff6691feef7a1122b0d616fd0167c19b62a81f619dfe4f2009
litellm-1 | 08:56:47 - LiteLLM:DEBUG: utils.py:4244 - checking potential_model_names in litellm.model_cost: {'split_model': 'meta_llama_llama_3_1_8b_instruct', 'combined_model_name': 'serval-llm/meta_llama_llama_3_1_8b_instruct', 'stripped_model_name': 'serval-llm/meta_llama_llama_3_1_8b_instruct', 'combined_stripped_model_name': 'serval-llm/meta_llama_llama_3_1_8b_instruct', 'custom_llm_provider': 'serval-llm'}
litellm-1 | 08:56:47 - LiteLLM:DEBUG: utils.py:4453 - Error getting model info: This model isn't mapped yet. Add it here - https://github.com/BerriAI/litellm/blob/main/model_prices_and_context_window.json
litellm-1 | 08:56:47 - LiteLLM:DEBUG: utils.py:2052 - added/updated model=serval-llm/meta_llama_llama_3_1_8b_instruct in litellm.model_cost: serval-llm/meta_llama_llama_3_1_8b_instruct
litellm-1 | 08:56:47 - LiteLLM:DEBUG: utils.py:4244 - checking potential_model_names in litellm.model_cost: {'split_model': '433db61fdae9146da69d1933b30aabe56a2db9448e90927fdcc2a12e6e470fc4', 'combined_model_name': '433db61fdae9146da69d1933b30aabe56a2db9448e90927fdcc2a12e6e470fc4', 'stripped_model_name': '433db61fdae9146da69d1933b30aabe56a2db9448e90927fdcc2a12e6e470fc4', 'combined_stripped_model_name': '433db61fdae9146da69d1933b30aabe56a2db9448e90927fdcc2a12e6e470fc4', 'custom_llm_provider': None}
litellm-1 | 08:56:47 - LiteLLM:DEBUG: utils.py:4453 - Error getting model info: This model isn't mapped yet. Add it here - https://github.com/BerriAI/litellm/blob/main/model_prices_and_context_window.json
litellm-1 | 08:56:47 - LiteLLM:DEBUG: utils.py:2052 - added/updated model=433db61fdae9146da69d1933b30aabe56a2db9448e90927fdcc2a12e6e470fc4 in litellm.model_cost: 433db61fdae9146da69d1933b30aabe56a2db9448e90927fdcc2a12e6e470fc4
litellm-1 | 08:56:47 - LiteLLM:DEBUG: utils.py:4244 - checking potential_model_names in litellm.model_cost: {'split_model': 'meta_llama_llama_3_1_8b_instruct_fake', 'combined_model_name': 'serval-llm/meta_llama_llama_3_1_8b_instruct_fake', 'stripped_model_name': 'serval-llm/meta_llama_llama_3_1_8b_instruct_fake', 'combined_stripped_model_name': 'serval-llm/meta_llama_llama_3_1_8b_instruct_fake', 'custom_llm_provider': 'serval-llm'}
litellm-1 | 08:56:47 - LiteLLM:DEBUG: utils.py:4453 - Error getting model info: This model isn't mapped yet. Add it here - https://github.com/BerriAI/litellm/blob/main/model_prices_and_context_window.json
litellm-1 | 08:56:47 - LiteLLM:DEBUG: utils.py:2052 - added/updated model=serval-llm/meta_llama_llama_3_1_8b_instruct_fake in litellm.model_cost: serval-llm/meta_llama_llama_3_1_8b_instruct_fake
litellm-1 | 08:56:47 - LiteLLM:DEBUG: utils.py:4244 - checking potential_model_names in litellm.model_cost: {'split_model': 'a3d830531240aebd7863a7a3f3dfee1468a2f6992029911f0c3168b94586a16c', 'combined_model_name': 'a3d830531240aebd7863a7a3f3dfee1468a2f6992029911f0c3168b94586a16c', 'stripped_model_name': 'a3d830531240aebd7863a7a3f3dfee1468a2f6992029911f0c3168b94586a16c', 'combined_stripped_model_name': 'a3d830531240aebd7863a7a3f3dfee1468a2f6992029911f0c3168b94586a16c', 'custom_llm_provider': None}
litellm-1 | 08:56:47 - LiteLLM:DEBUG: utils.py:4453 - Error getting model info: This model isn't mapped yet. Add it here - https://github.com/BerriAI/litellm/blob/main/model_prices_and_context_window.json
litellm-1 | 08:56:47 - LiteLLM:DEBUG: utils.py:2052 - added/updated model=a3d830531240aebd7863a7a3f3dfee1468a2f6992029911f0c3168b94586a16c in litellm.model_cost: a3d830531240aebd7863a7a3f3dfee1468a2f6992029911f0c3168b94586a16c
litellm-1 | 08:56:47 - LiteLLM:DEBUG: utils.py:4244 - checking potential_model_names in litellm.model_cost: {'split_model': 'gpt-4o', 'combined_model_name': 'azure/gpt-4o', 'stripped_model_name': 'azure/gpt-4o', 'combined_stripped_model_name': 'azure/gpt-4o', 'custom_llm_provider': 'azure'}
litellm-1 | 08:56:47 - LiteLLM:DEBUG: utils.py:4542 - model_info: {'key': 'azure/gpt-4o', 'max_tokens': 16384, 'max_input_tokens': 128000, 'max_output_tokens': 16384, 'input_cost_per_token': 2.5e-06, 'cache_creation_input_token_cost': None, 'cache_read_input_token_cost': 1.25e-06, 'input_cost_per_character': None, 'input_cost_per_token_above_128k_tokens': None, 'input_cost_per_token_above_200k_tokens': None, 'input_cost_per_query': None, 'input_cost_per_second': None, 'input_cost_per_audio_token': None, 'input_cost_per_token_batches': None, 'output_cost_per_token_batches': None, 'output_cost_per_token': 1e-05, 'output_cost_per_audio_token': None, 'output_cost_per_character': None, 'output_cost_per_reasoning_token': None, 'output_cost_per_token_above_128k_tokens': None, 'output_cost_per_character_above_128k_tokens': None, 'output_cost_per_token_above_200k_tokens': None, 'output_cost_per_second': None, 'output_cost_per_image': None, 'output_vector_size': None, 'litellm_provider': 'azure', 'mode': 'chat', 'supports_system_messages': None, 'supports_response_schema': True, 'supports_vision': True, 'supports_function_calling': True, 'supports_tool_choice': True, 'supports_assistant_prefill': False, 'supports_prompt_caching': True, 'supports_audio_input': False, 'supports_audio_output': False, 'supports_pdf_input': False, 'supports_embedding_image_input': False, 'supports_native_streaming': None, 'supports_web_search': False, 'supports_reasoning': False, 'search_context_cost_per_query': None, 'tpm': None, 'rpm': None}
litellm-1 | 08:56:47 - LiteLLM:DEBUG: utils.py:2052 - added/updated model=azure/gpt-4o in litellm.model_cost: azure/gpt-4o
litellm-1 | 08:56:47 - LiteLLM:DEBUG: utils.py:4244 - checking potential_model_names in litellm.model_cost: {'split_model': '5ed14060b96e244a34cd27b5d8cf708415fda0c5e9fe66656a2a68e564e82a2f', 'combined_model_name': '5ed14060b96e244a34cd27b5d8cf708415fda0c5e9fe66656a2a68e564e82a2f', 'stripped_model_name': '5ed14060b96e244a34cd27b5d8cf708415fda0c5e9fe66656a2a68e564e82a2f', 'combined_stripped_model_name': '5ed14060b96e244a34cd27b5d8cf708415fda0c5e9fe66656a2a68e564e82a2f', 'custom_llm_provider': None}
litellm-1 | 08:56:47 - LiteLLM:DEBUG: utils.py:4453 - Error getting model info: This model isn't mapped yet. Add it here - https://github.com/BerriAI/litellm/blob/main/model_prices_and_context_window.json
litellm-1 | 08:56:47 - LiteLLM:DEBUG: utils.py:2052 - added/updated model=5ed14060b96e244a34cd27b5d8cf708415fda0c5e9fe66656a2a68e564e82a2f in litellm.model_cost: 5ed14060b96e244a34cd27b5d8cf708415fda0c5e9fe66656a2a68e564e82a2f
litellm-1 | 08:56:47 - LiteLLM:DEBUG: utils.py:4244 - checking potential_model_names in litellm.model_cost: {'split_model': 'gpt4o_test', 'combined_model_name': 'azure/gpt4o_test', 'stripped_model_name': 'azure/gpt4o_test', 'combined_stripped_model_name': 'azure/gpt4o_test', 'custom_llm_provider': 'azure'}
litellm-1 | 08:56:47 - LiteLLM:DEBUG: utils.py:4453 - Error getting model info: This model isn't mapped yet. Add it here - https://github.com/BerriAI/litellm/blob/main/model_prices_and_context_window.json
litellm-1 | 08:56:47 - LiteLLM:DEBUG: utils.py:2052 - added/updated model=azure/gpt4o_test in litellm.model_cost: azure/gpt4o_test
litellm-1 | 08:56:47 - LiteLLM Router:DEBUG: router.py:4562 -
litellm-1 | Initialized Model List ['qwen_qwen3_235b_a22b', 'meta_llama_llama_3_1_8b_instruct', 'meta_llama_llama_3_1_8b_instruct_fake', 'gpt4o', 'gpt4o_test']
litellm-1 | 08:56:47 - LiteLLM Router:INFO: router.py:656 - Routing strategy: simple-shuffle
litellm-1 | 08:56:47 - LiteLLM Router:DEBUG: router.py:541 - Intialized router with Routing strategy: simple-shuffle
litellm-1 |
litellm-1 | Routing enable_pre_call_checks: False
litellm-1 |
litellm-1 | Routing fallbacks: [{'meta_llama_llama_3_1_8b_instruct': ['qwen_qwen3_235b_a22b']}, {'meta_llama_llama_3_1_8b_instruct_fake': ['gpt4o']}, {'gpt4o': ['qwen_qwen3_235b_a22b']}, {'gpt4o_test': ['qwen_qwen3_235b_a22b']}]
litellm-1 |
litellm-1 | Routing content fallbacks: None
litellm-1 |
litellm-1 | Routing context window fallbacks: None
litellm-1 |
litellm-1 | Router Redis Caching=None
litellm-1 |
litellm-1 | 08:56:47 - LiteLLM Proxy:DEBUG: proxy_server.py:588 - prisma_client: None
litellm-1 | INFO: Application startup complete.
litellm-1 | INFO: Uvicorn running on http://*******:4000 (Press CTRL+C to quit)
litellm-1 | 08:56:49 - LiteLLM Proxy:DEBUG: common_request_processing.py:213 - Request received by LiteLLM:
litellm-1 | {
litellm-1 | "model": "meta_llama_llama_3_1_8b_instruct_fake",
litellm-1 | "messages": [
litellm-1 | {
litellm-1 | "role": "user",
litellm-1 | "content": "what llm are you"
litellm-1 | }
litellm-1 | ],
litellm-1 | "stream": true
litellm-1 | }
litellm-1 | 08:56:49 - LiteLLM Proxy:DEBUG: litellm_pre_call_utils.py:565 - Request Headers: Headers({'host': 'localhost:4000', 'user-agent': 'curl/8.7.1', 'accept': '*/*', 'content-type': 'application/json', 'content-length': '195'})
litellm-1 | 08:56:49 - LiteLLM Proxy:DEBUG: litellm_pre_call_utils.py:571 - receiving data: {'model': 'meta_llama_llama_3_1_8b_instruct_fake', 'messages': [{'role': 'user', 'content': 'what llm are you'}], 'stream': True, 'proxy_server_request': {'url': 'http://localhost:4000/chat/completions', 'method': 'POST', 'headers': {'host': 'localhost:4000', 'user-agent': 'curl/8.7.1', 'accept': '*/*', 'content-type': 'application/json', 'content-length': '195'}, 'body': {'model': 'meta_llama_llama_3_1_8b_instruct_fake', 'messages': [{'role': 'user', 'content': 'what llm are you'}], 'stream': True}}}
litellm-1 | 08:56:49 - LiteLLM Proxy:DEBUG: litellm_pre_call_utils.py:737 - [PROXY] returned data from litellm_pre_call_utils: {'model': 'meta_llama_llama_3_1_8b_instruct_fake', 'messages': [{'role': 'user', 'content': 'what llm are you'}], 'stream': True, 'proxy_server_request': {'url': 'http://localhost:4000/chat/completions', 'method': 'POST', 'headers': {'host': 'localhost:4000', 'user-agent': 'curl/8.7.1', 'accept': '*/*', 'content-type': 'application/json', 'content-length': '195'}, 'body': {'model': 'meta_llama_llama_3_1_8b_instruct_fake', 'messages': [{'role': 'user', 'content': 'what llm are you'}], 'stream': True}}, 'metadata': {'requester_metadata': {}, 'user_api_key_hash': None, 'user_api_key_alias': None, 'user_api_key_team_id': None, 'user_api_key_user_id': None, 'user_api_key_org_id': None, 'user_api_key_team_alias': None, 'user_api_key_end_user_id': None, 'user_api_key_user_email': None, 'user_api_key': None, 'user_api_end_user_max_budget': None, 'litellm_api_version': '1.70.0', 'global_max_parallel_requests': None, 'user_api_key_team_max_budget': None, 'user_api_key_team_spend': None, 'user_api_key_spend': 0.0, 'user_api_key_max_budget': None, 'user_api_key_model_max_budget': {}, 'user_api_key_metadata': {}, 'headers': {'host': 'localhost:4000', 'user-agent': 'curl/8.7.1', 'accept': '*/*', 'content-type': 'application/json', 'content-length': '195'}, 'endpoint': 'http://localhost:4000/chat/completions', 'litellm_parent_otel_span': None, 'requester_ip_address': ''}}
litellm-1 | 08:56:49 - LiteLLM Proxy:DEBUG: utils.py:524 - Inside Proxy Logging Pre-call hook!
litellm-1 | 08:56:49 - LiteLLM Proxy:DEBUG: max_budget_limiter.py:23 - Inside Max Budget Limiter Pre-Call Hook
litellm-1 | 08:56:49 - LiteLLM Proxy:DEBUG: parallel_request_limiter.py:48 - Inside Max Parallel Request Pre-Call Hook
litellm-1 | 08:56:49 - LiteLLM Proxy:DEBUG: cache_control_check.py:27 - Inside Cache Control Check Pre-Call Hook
litellm-1 | 08:56:49 - LiteLLM:DEBUG: utils.py:336 - Initialized litellm callbacks, Async Success Callbacks: [<bound method Router.deployment_callback_on_success of <litellm.router.Router object at 0x7ffff5aa0980>>, <litellm.proxy.hooks.model_max_budget_limiter._PROXY_VirtualKeyModelMaxBudgetLimiter object at 0x7ffff66a8590>, <litellm.proxy.hooks.max_budget_limiter._PROXY_MaxBudgetLimiter object at 0x7ffff5b634d0>, <litellm.proxy.hooks.parallel_request_limiter._PROXY_MaxParallelRequestsHandler object at 0x7ffff5b63610>, <litellm.proxy.hooks.cache_control_check._PROXY_CacheControlCheck object at 0x7ffff5b63390>, <litellm._service_logger.ServiceLogging object at 0x7ffff6716060>]
litellm-1 | 08:56:49 - LiteLLM:DEBUG: litellm_logging.py:458 - self.optional_params: {}
litellm-1 | 08:56:49 - LiteLLM Router:DEBUG: router.py:3547 - Inside async function with retries.
litellm-1 | 08:56:49 - LiteLLM Router:DEBUG: router.py:3567 - async function w/ retries: original_function - <bound method Router._acompletion of <litellm.router.Router object at 0x7ffff5aa0980>>, num_retries - 2
litellm-1 | 08:56:49 - LiteLLM Router:DEBUG: router.py:984 - Inside _acompletion()- model: meta_llama_llama_3_1_8b_instruct_fake; kwargs: {'proxy_server_request': {'url': 'http://localhost:4000/chat/completions', 'method': 'POST', 'headers': {'host': 'localhost:4000', 'user-agent': 'curl/8.7.1', 'accept': '*/*', 'content-type': 'application/json', 'content-length': '195'}, 'body': {'model': 'meta_llama_llama_3_1_8b_instruct_fake', 'messages': [{'role': 'user', 'content': 'what llm are you'}], 'stream': True}}, 'metadata': {'requester_metadata': {}, 'user_api_key_hash': None, 'user_api_key_alias': None, 'user_api_key_team_id': None, 'user_api_key_user_id': None, 'user_api_key_org_id': None, 'user_api_key_team_alias': None, 'user_api_key_end_user_id': None, 'user_api_key_user_email': None, 'user_api_key': None, 'user_api_end_user_max_budget': None, 'litellm_api_version': '1.70.0', 'global_max_parallel_requests': None, 'user_api_key_team_max_budget': None, 'user_api_key_team_spend': None, 'user_api_key_spend': 0.0, 'user_api_key_max_budget': None, 'user_api_key_model_max_budget': {}, 'user_api_key_metadata': {}, 'headers': {'host': 'localhost:4000', 'user-agent': 'curl/8.7.1', 'accept': '*/*', 'content-type': 'application/json', 'content-length': '195'}, 'endpoint': 'http://localhost:4000/chat/completions', 'litellm_parent_otel_span': None, 'requester_ip_address': '', 'model_group': 'meta_llama_llama_3_1_8b_instruct_fake', 'model_group_size': 1}, 'litellm_call_id': '************************************', 'litellm_logging_obj': <litellm.litellm_core_utils.litellm_logging.Logging object at 0x7ffff5aa34d0>, 'stream': True, 'litellm_trace_id': '************************************'}
litellm-1 | 08:56:49 - LiteLLM Router:DEBUG: router.py:6012 - initial list of deployments: [{'model_name': 'meta_llama_llama_3_1_8b_instruct_fake', 'litellm_params': {'use_in_pass_through': False, 'use_litellm_proxy': False, 'merge_reasoning_content_in_choices': False, 'model': 'serval-llm/meta_llama_llama_3_1_8b_instruct_fake', 'api_url': '**************'}, 'model_info': {'id': '433db61fdae9146da69d1933b30aabe56a2db9448e90927fdcc2a12e6e470fc4', 'db_model': False}}]
litellm-1 | 08:56:49 - LiteLLM Router:DEBUG: cooldown_handlers.py:326 - retrieve cooldown models: []
litellm-1 | 08:56:49 - LiteLLM Router:DEBUG: router.py:6062 - async cooldown deployments: []
litellm-1 | 08:56:49 - LiteLLM Router:DEBUG: router.py:6065 - cooldown_deployments: []
litellm-1 | 08:56:49 - LiteLLM Router:DEBUG: router.py:6385 - cooldown deployments: []
litellm-1 | 08:56:49 - LiteLLM:DEBUG: utils.py:336 -
litellm-1 |
litellm-1 | 08:56:49 - LiteLLM:DEBUG: utils.py:336 - Request to litellm:
litellm-1 | 08:56:49 - LiteLLM:DEBUG: utils.py:336 - litellm.acompletion(use_in_pass_through=False, use_litellm_proxy=False, merge_reasoning_content_in_choices=False, model='serval-llm/meta_llama_llama_3_1_8b_instruct_fake', api_url='****************', messages=[{'role': 'user', 'content': 'what llm are you'}], caching=False, client=None, proxy_server_request={'url': 'http://localhost:4000/chat/completions', 'method': 'POST', 'headers': {'host': 'localhost:4000', 'user-agent': 'curl/8.7.1', 'accept': '*/*', 'content-type': 'application/json', 'content-length': '195'}, 'body': {'model': 'meta_llama_llama_3_1_8b_instruct_fake', 'messages': [{'role': 'user', 'content': 'what llm are you'}], 'stream': True}}, metadata={'requester_metadata': {}, 'user_api_key_hash': None, 'user_api_key_alias': None, 'user_api_key_team_id': None, 'user_api_key_user_id': None, 'user_api_key_org_id': None, 'user_api_key_team_alias': None, 'user_api_key_end_user_id': None, 'user_api_key_user_email': None, 'user_api_key': None, 'user_api_end_user_max_budget': None, 'litellm_api_version': '1.70.0', 'global_max_parallel_requests': None, 'user_api_key_team_max_budget': None, 'user_api_key_team_spend': None, 'user_api_key_spend': 0.0, 'user_api_key_max_budget': None, 'user_api_key_model_max_budget': {}, 'user_api_key_metadata': {}, 'headers': {'host': 'localhost:4000', 'user-agent': 'curl/8.7.1', 'accept': '*/*', 'content-type': 'application/json', 'content-length': '195'}, 'endpoint': 'http://localhost:4000/chat/completions', 'litellm_parent_otel_span': None, 'requester_ip_address': '', 'model_group': 'meta_llama_llama_3_1_8b_instruct_fake', 'model_group_size': 1, 'deployment': 'serval-llm/meta_llama_llama_3_1_8b_instruct_fake', 'model_info': {'id': '433db61fdae9146da69d1933b30aabe56a2db9448e90927fdcc2a12e6e470fc4', 'db_model': False}, 'api_base': None, 'caching_groups': None}, litellm_call_id='************************************', litellm_logging_obj=<litellm.litellm_core_utils.litellm_logging.Logging object at 0x7ffff5aa34d0>, stream=True, litellm_trace_id='************************************', model_info={'id': '433db61fdae9146da69d1933b30aabe56a2db9448e90927fdcc2a12e6e470fc4', 'db_model': False}, timeout=10, max_retries=0)
litellm-1 | 08:56:49 - LiteLLM:DEBUG: utils.py:336 -
litellm-1 |
litellm-1 | 08:56:49 - LiteLLM:DEBUG: utils.py:336 - ASYNC kwargs[caching]: False; litellm.cache: None; kwargs.get('cache'): None
litellm-1 | 08:56:49 - LiteLLM:DEBUG: caching_handler.py:211 - CACHE RESULT: None
litellm-1 | 08:56:49 - LiteLLM:INFO: utils.py:2904 -
litellm-1 | LiteLLM completion() model= meta_llama_llama_3_1_8b_instruct_fake; provider = serval-llm
litellm-1 | 08:56:49 - LiteLLM:DEBUG: utils.py:2907 -
litellm-1 | LiteLLM: Params passed to completion() {'model': 'meta_llama_llama_3_1_8b_instruct_fake', 'functions': None, 'function_call': None, 'temperature': None, 'top_p': None, 'n': None, 'stream': True, 'stream_options': None, 'stop': None, 'max_tokens': None, 'max_completion_tokens': None, 'modalities': None, 'prediction': None, 'audio': None, 'presence_penalty': None, 'frequency_penalty': None, 'logit_bias': None, 'user': None, 'custom_llm_provider': 'serval-llm', 'response_format': None, 'seed': None, 'tools': None, 'tool_choice': None, 'max_retries': 0, 'logprobs': None, 'top_logprobs': None, 'extra_headers': None, 'api_version': None, 'parallel_tool_calls': None, 'drop_params': None, 'allowed_openai_params': None, 'reasoning_effort': None, 'additional_drop_params': None, 'messages': [{'role': 'user', 'content': 'what llm are you'}], 'thinking': None, 'web_search_options': None, 'api_url': '*************'}
litellm-1 | 08:56:49 - LiteLLM:DEBUG: utils.py:2910 -
litellm-1 | LiteLLM: Non-Default params passed to completion() {'stream': True, 'max_retries': 0}
litellm-1 | 08:56:49 - LiteLLM:DEBUG: utils.py:336 - Final returned optional params: {'stream': True, 'max_retries': 0, 'api_url': '*************'}
litellm-1 | 08:56:49 - LiteLLM:DEBUG: litellm_logging.py:458 - self.optional_params: {'stream': True, 'max_retries': 0, 'api_url': '*************'}
litellm-1 | 08:56:49 - LiteLLM:INFO: cost_calculator.py:655 - selected model name for cost calculation: serval-llm/serval-llm/meta_llama_llama_3_1_8b_instruct_fake
litellm-1 | 08:56:49 - LiteLLM:DEBUG: token_counter.py:365 - messages in token_counter: None, text in token_counter:
litellm-1 | 08:56:49 - LiteLLM:DEBUG: utils.py:4244 - checking potential_model_names in litellm.model_cost: {'split_model': 'meta_llama_llama_3_1_8b_instruct_fake', 'combined_model_name': 'serval-llm/meta_llama_llama_3_1_8b_instruct_fake', 'stripped_model_name': 'meta_llama_llama_3_1_8b_instruct_fake', 'combined_stripped_model_name': 'serval-llm/meta_llama_llama_3_1_8b_instruct_fake', 'custom_llm_provider': 'serval-llm'}
litellm-1 | 08:56:49 - LiteLLM:DEBUG: utils.py:4341 - model=serval-llm/meta_llama_llama_3_1_8b_instruct_fake, custom_llm_provider=serval-llm has no input_cost_per_token in model_cost_map. Defaulting to 0.
litellm-1 | 08:56:49 - LiteLLM:DEBUG: utils.py:4353 - model=serval-llm/meta_llama_llama_3_1_8b_instruct_fake, custom_llm_provider=serval-llm has no output_cost_per_token in model_cost_map. Defaulting to 0.
litellm-1 | 08:56:49 - LiteLLM:DEBUG: cost_calculator.py:376 - Returned custom cost for model=serval-llm/meta_llama_llama_3_1_8b_instruct_fake - prompt_tokens_cost_usd_dollar: 0, completion_tokens_cost_usd_dollar: 0
litellm-1 | 08:56:49 - LiteLLM:DEBUG: litellm_logging.py:1124 - response_cost: 0.0
litellm-1 | 08:56:49 - LiteLLM Router:INFO: router.py:1081 - litellm.acompletion(model=serval-llm/meta_llama_llama_3_1_8b_instruct_fake) 200 OK
litellm-1 | 08:56:49 - LiteLLM Router:DEBUG: router.py:3307 - Async Response: <litellm.litellm_core_utils.streaming_handler.CustomStreamWrapper object at 0x7ffff5aa3b60>
litellm-1 | 08:56:49 - LiteLLM Proxy:DEBUG: proxy_server.py:3011 - inside generator
litellm-1 | 08:56:49 - LiteLLM:DEBUG: litellm_logging.py:2159 - Logging Details LiteLLM-Failure Call: [<bound method Router.deployment_callback_on_failure of <litellm.router.Router object at 0x7ffff5aa0980>>, <litellm.proxy.hooks.model_max_budget_limiter._PROXY_VirtualKeyModelMaxBudgetLimiter object at 0x7ffff66a8590>, <litellm.proxy.hooks.max_budget_limiter._PROXY_MaxBudgetLimiter object at 0x7ffff5b634d0>, <litellm.proxy.hooks.parallel_request_limiter._PROXY_MaxParallelRequestsHandler object at 0x7ffff5b63610>, <litellm.proxy.hooks.cache_control_check._PROXY_CacheControlCheck object at 0x7ffff5b63390>, <litellm._service_logger.ServiceLogging object at 0x7ffff6716060>]
litellm-1 | 08:56:49 - LiteLLM:DEBUG: get_api_base.py:62 - Error occurred in getting api base - litellm.BadRequestError: LLM Provider NOT provided. Pass in the LLM provider you are trying to call. You passed model=meta_llama_llama_3_1_8b_instruct_fake
litellm-1 | Pass model as E.g. For 'Huggingface' inference endpoints pass in `completion(model='huggingface/starcoder',..)` Learn more: https://docs.litellm.ai/docs/providers
litellm-1 | 08:56:49 - LiteLLM:DEBUG: exception_mapping_utils.py:2261 - Logging Details: logger_fn - None | callable(logger_fn) - False
litellm-1 | 08:56:49 - LiteLLM Proxy:ERROR: proxy_server.py:3038 - litellm.proxy.proxy_server.async_data_generator(): Exception occured - litellm.APIConnectionError: Not implemented yet!
litellm-1 | Traceback (most recent call last):
litellm-1 | File "/home/xxx/app/lib/python3.13/site-packages/litellm/litellm_core_utils/streaming_handler.py", line 1670, in __anext__
litellm-1 | async for chunk in self.completion_stream:
litellm-1 | ...<48 lines>...
litellm-1 | return processed_chunk
litellm-1 | File "/home/xxx/app/litellm_model.py", line 185, in astreaming
litellm-1 | raise CustomLLMError(status_code=500, message="Not implemented yet!")
litellm-1 | litellm_model.CustomLLMError: Not implemented yet!
litellm-1 | Traceback (most recent call last):
litellm-1 | File "/home/xxx/app/lib/python3.13/site-packages/litellm/litellm_core_utils/streaming_handler.py", line 1670, in __anext__
litellm-1 | async for chunk in self.completion_stream:
litellm-1 | ...<48 lines>...
litellm-1 | return processed_chunk
litellm-1 | File "/home/xxx/app/litellm_model.py", line 185, in astreaming
litellm-1 | raise CustomLLMError(status_code=500, message="Not implemented yet!")
litellm-1 | litellm_model.CustomLLMError: Not implemented yet!
litellm-1 |
litellm-1 | During handling of the above exception, another exception occurred:
litellm-1 |
litellm-1 | Traceback (most recent call last):
litellm-1 | File "/home/xxx/app/lib/python3.13/site-packages/litellm/proxy/proxy_server.py", line 3013, in async_data_generator
litellm-1 | async for chunk in proxy_logging_obj.async_post_call_streaming_iterator_hook(
litellm-1 | ...<18 lines>...
litellm-1 | yield f"data: {str(e)}\n\n"
litellm-1 | File "/home/xxx/app/lib/python3.13/site-packages/litellm/integrations/custom_logger.py", line 288, in async_post_call_streaming_iterator_hook
litellm-1 | async for item in response:
litellm-1 | yield item
litellm-1 | File "/home/xxx/app/lib/python3.13/site-packages/litellm/integrations/custom_logger.py", line 288, in async_post_call_streaming_iterator_hook
litellm-1 | async for item in response:
litellm-1 | yield item
litellm-1 | File "/home/xxx/app/lib/python3.13/site-packages/litellm/integrations/custom_logger.py", line 288, in async_post_call_streaming_iterator_hook
litellm-1 | async for item in response:
litellm-1 | yield item
litellm-1 | [Previous line repeated 2 more times]
litellm-1 | File "/home/xxx/app/lib/python3.13/site-packages/litellm/litellm_core_utils/streaming_handler.py", line 1831, in __anext__
litellm-1 | raise exception_type(
litellm-1 | ~~~~~~~~~~~~~~^
litellm-1 | model=self.model,
litellm-1 | ^^^^^^^^^^^^^^^^^
litellm-1 | ...<3 lines>...
litellm-1 | extra_kwargs={},
litellm-1 | ^^^^^^^^^^^^^^^^
litellm-1 | )
litellm-1 | ^
litellm-1 | File "/home/xxx/app/lib/python3.13/site-packages/litellm/litellm_core_utils/exception_mapping_utils.py", line 2232, in exception_type
litellm-1 | raise e
litellm-1 | File "/home/xxx/app/lib/python3.13/site-packages/litellm/litellm_core_utils/exception_mapping_utils.py", line 2208, in exception_type
litellm-1 | raise APIConnectionError(
litellm-1 | ...<8 lines>...
litellm-1 | )
litellm-1 | litellm.exceptions.APIConnectionError: litellm.APIConnectionError: Not implemented yet!
litellm-1 | Traceback (most recent call last):
litellm-1 | File "/home/xxx/app/lib/python3.13/site-packages/litellm/litellm_core_utils/streaming_handler.py", line 1670, in __anext__
litellm-1 | async for chunk in self.completion_stream:
litellm-1 | ...<48 lines>...
litellm-1 | return processed_chunk
litellm-1 | File "/home/xxx/app/litellm_model.py", line 185, in astreaming
litellm-1 | raise CustomLLMError(status_code=500, message="Not implemented yet!")
litellm-1 | litellm_model.CustomLLMError: Not implemented yet!
litellm-1 |
litellm-1 | 08:56:49 - LiteLLM Router:DEBUG: router.py:4019 - Router: Entering 'deployment_callback_on_failure'
litellm-1 | 08:56:49 - LiteLLM Proxy:DEBUG: proxy_server.py:3048 - An error occurred: litellm.APIConnectionError: Not implemented yet!
litellm-1 | Traceback (most recent call last):
litellm-1 | File "/home/xxx/app/lib/python3.13/site-packages/litellm/litellm_core_utils/streaming_handler.py", line 1670, in __anext__
litellm-1 | async for chunk in self.completion_stream:
litellm-1 | ...<48 lines>...
litellm-1 | return processed_chunk
litellm-1 | File "/home/xxx/app/litellm_model.py", line 185, in astreaming
litellm-1 | raise CustomLLMError(status_code=500, message="Not implemented yet!")
litellm-1 | litellm_model.CustomLLMError: Not implemented yet!
litellm-1 |
litellm-1 |
litellm-1 | Debug this by setting `--debug`, e.g. `litellm --model gpt-3.5-turbo --debug`
litellm-1 | 08:56:49 - LiteLLM Router:DEBUG: cooldown_handlers.py:260 - checks 'should_run_cooldown_logic'
litellm-1 | 08:56:49 - LiteLLM Router:DEBUG: cooldown_handlers.py:274 - Attempting to add 433db61fdae9146da69d1933b30aabe56a2db9448e90927fdcc2a12e6e470fc4 to cooldown list
litellm-1 | 08:56:49 - LiteLLM Router:DEBUG: cooldown_handlers.py:199 - percent fails for deployment = 433db61fdae9146da69d1933b30aabe56a2db9448e90927fdcc2a12e6e470fc4, percent fails = 1.0, num successes = 0, num fails = 1
litellm-1 | INFO: ************:51493 - "POST /chat/completions HTTP/1.1" 200 OK
litellm-1 | 08:56:49 - LiteLLM Proxy:DEBUG: parallel_request_limiter.py:48 - Inside Max Parallel Request Failure Hook
litellm-1 | 08:56:49 - LiteLLM Proxy:DEBUG: parallel_request_limiter.py:48 - user_api_key: None
Are you a ML Ops Team?
Yes
What LiteLLM version are you on ?
v1.69.0.patch1
Twitter / LinkedIn details
No response
Same error,my version is 1.74.9 I think the reason is that, in non-streaming mode, await init_response raises an exception that is then caught. https://github.com/BerriAI/litellm/blob/c99277c51736a331d1deeb49339adba997fa1b42/litellm/main.py#L544 However, when using streaming mode, init_response is an instance of CustomStreamWrapper, and any exceptions are handled within the flowing function, which is outside of async_function_with_fallbacks. https://github.com/BerriAI/litellm/blob/c99277c51736a331d1deeb49339adba997fa1b42/litellm/proxy/proxy_server.py#L3462 @krrishdholakia Could you please take a look and advise on how this can be addressed?
related code https://github.com/BerriAI/litellm/blob/c99277c51736a331d1deeb49339adba997fa1b42/litellm/main.py#L3485-L3516
Hi, @ishaan-jaff Could you please take a look and advise on how this can be addressed?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
bump