litellm icon indicating copy to clipboard operation
litellm copied to clipboard

[Bug]: Fallback models dont work for CustomLLM on streaming endpoints

Open hahamark1 opened this issue 7 months ago • 5 comments

What happened?

I am working on a customLLM in a proxy server, where the customLLM is called with a streaming request and should fallback on another model if the request fails/times out. The exceptions are raised, but this does not switch to the fallback, however for the same setup this does work for the non-streaming calls. I feel the behavior should be the same.

Relevant log output

litellm-1  | INFO:     Started server process [1]
litellm-1  | INFO:     Waiting for application startup.
litellm-1  | 
litellm-1  | #------------------------------------------------------------#
litellm-1  | #                                                            #
litellm-1  | #            'This product would be better if...'             #
litellm-1  | #        https://github.com/BerriAI/litellm/issues/new        #
litellm-1  | #                                                            #
litellm-1  | #------------------------------------------------------------#
litellm-1  | 
litellm-1  |  Thank you for using LiteLLM! - Krrish & Ishaan
litellm-1  | 
litellm-1  | 
litellm-1  | 
litellm-1  | Give Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new
litellm-1  | 
litellm-1  | 
litellm-1  | LiteLLM: Proxy initialized with Config, Set models:
litellm-1  |     qwen_qwen3_235b_a22b
litellm-1  |     meta_llama_llama_3_1_8b_instruct
litellm-1  |     meta_llama_llama_3_1_8b_instruct_fake
litellm-1  |     gpt4o
litellm-1  |     gpt4o_test
litellm-1  | 08:56:47 - LiteLLM:DEBUG: utils.py:4244 - checking potential_model_names in litellm.model_cost: {'split_model': 'b50a91c389bb76883145f5df3dcdf0ab5103c0f16da0a1be09eb3562b451b8d9', 'combined_model_name': 'b50a91c389bb76883145f5df3dcdf0ab5103c0f16da0a1be09eb3562b451b8d9', 'stripped_model_name': 'b50a91c389bb76883145f5df3dcdf0ab5103c0f16da0a1be09eb3562b451b8d9', 'combined_stripped_model_name': 'b50a91c389bb76883145f5df3dcdf0ab5103c0f16da0a1be09eb3562b451b8d9', 'custom_llm_provider': None}
litellm-1  | 08:56:47 - LiteLLM:DEBUG: utils.py:4453 - Error getting model info: This model isn't mapped yet. Add it here - https://github.com/BerriAI/litellm/blob/main/model_prices_and_context_window.json
litellm-1  | 08:56:47 - LiteLLM:DEBUG: utils.py:2052 - added/updated model=b50a91c389bb76883145f5df3dcdf0ab5103c0f16da0a1be09eb3562b451b8d9 in litellm.model_cost: b50a91c389bb76883145f5df3dcdf0ab5103c0f16da0a1be09eb3562b451b8d9
litellm-1  | 08:56:47 - LiteLLM:DEBUG: utils.py:4244 - checking potential_model_names in litellm.model_cost: {'split_model': 'qwen_qwen3_235b_a22b', 'combined_model_name': 'serval-llm/qwen_qwen3_235b_a22b', 'stripped_model_name': 'serval-llm/qwen_qwen3_235b_a22b', 'combined_stripped_model_name': 'serval-llm/qwen_qwen3_235b_a22b', 'custom_llm_provider': 'serval-llm'}
litellm-1  | 08:56:47 - LiteLLM:DEBUG: utils.py:4453 - Error getting model info: This model isn't mapped yet. Add it here - https://github.com/BerriAI/litellm/blob/main/model_prices_and_context_window.json
litellm-1  | 08:56:47 - LiteLLM:DEBUG: utils.py:2052 - added/updated model=serval-llm/qwen_qwen3_235b_a22b in litellm.model_cost: serval-llm/qwen_qwen3_235b_a22b
litellm-1  | 08:56:47 - LiteLLM:DEBUG: utils.py:4244 - checking potential_model_names in litellm.model_cost: {'split_model': 'd65327edefb8f1ff6691feef7a1122b0d616fd0167c19b62a81f619dfe4f2009', 'combined_model_name': 'd65327edefb8f1ff6691feef7a1122b0d616fd0167c19b62a81f619dfe4f2009', 'stripped_model_name': 'd65327edefb8f1ff6691feef7a1122b0d616fd0167c19b62a81f619dfe4f2009', 'combined_stripped_model_name': 'd65327edefb8f1ff6691feef7a1122b0d616fd0167c19b62a81f619dfe4f2009', 'custom_llm_provider': None}
litellm-1  | 08:56:47 - LiteLLM:DEBUG: utils.py:4453 - Error getting model info: This model isn't mapped yet. Add it here - https://github.com/BerriAI/litellm/blob/main/model_prices_and_context_window.json
litellm-1  | 08:56:47 - LiteLLM:DEBUG: utils.py:2052 - added/updated model=d65327edefb8f1ff6691feef7a1122b0d616fd0167c19b62a81f619dfe4f2009 in litellm.model_cost: d65327edefb8f1ff6691feef7a1122b0d616fd0167c19b62a81f619dfe4f2009
litellm-1  | 08:56:47 - LiteLLM:DEBUG: utils.py:4244 - checking potential_model_names in litellm.model_cost: {'split_model': 'meta_llama_llama_3_1_8b_instruct', 'combined_model_name': 'serval-llm/meta_llama_llama_3_1_8b_instruct', 'stripped_model_name': 'serval-llm/meta_llama_llama_3_1_8b_instruct', 'combined_stripped_model_name': 'serval-llm/meta_llama_llama_3_1_8b_instruct', 'custom_llm_provider': 'serval-llm'}
litellm-1  | 08:56:47 - LiteLLM:DEBUG: utils.py:4453 - Error getting model info: This model isn't mapped yet. Add it here - https://github.com/BerriAI/litellm/blob/main/model_prices_and_context_window.json
litellm-1  | 08:56:47 - LiteLLM:DEBUG: utils.py:2052 - added/updated model=serval-llm/meta_llama_llama_3_1_8b_instruct in litellm.model_cost: serval-llm/meta_llama_llama_3_1_8b_instruct
litellm-1  | 08:56:47 - LiteLLM:DEBUG: utils.py:4244 - checking potential_model_names in litellm.model_cost: {'split_model': '433db61fdae9146da69d1933b30aabe56a2db9448e90927fdcc2a12e6e470fc4', 'combined_model_name': '433db61fdae9146da69d1933b30aabe56a2db9448e90927fdcc2a12e6e470fc4', 'stripped_model_name': '433db61fdae9146da69d1933b30aabe56a2db9448e90927fdcc2a12e6e470fc4', 'combined_stripped_model_name': '433db61fdae9146da69d1933b30aabe56a2db9448e90927fdcc2a12e6e470fc4', 'custom_llm_provider': None}
litellm-1  | 08:56:47 - LiteLLM:DEBUG: utils.py:4453 - Error getting model info: This model isn't mapped yet. Add it here - https://github.com/BerriAI/litellm/blob/main/model_prices_and_context_window.json
litellm-1  | 08:56:47 - LiteLLM:DEBUG: utils.py:2052 - added/updated model=433db61fdae9146da69d1933b30aabe56a2db9448e90927fdcc2a12e6e470fc4 in litellm.model_cost: 433db61fdae9146da69d1933b30aabe56a2db9448e90927fdcc2a12e6e470fc4
litellm-1  | 08:56:47 - LiteLLM:DEBUG: utils.py:4244 - checking potential_model_names in litellm.model_cost: {'split_model': 'meta_llama_llama_3_1_8b_instruct_fake', 'combined_model_name': 'serval-llm/meta_llama_llama_3_1_8b_instruct_fake', 'stripped_model_name': 'serval-llm/meta_llama_llama_3_1_8b_instruct_fake', 'combined_stripped_model_name': 'serval-llm/meta_llama_llama_3_1_8b_instruct_fake', 'custom_llm_provider': 'serval-llm'}
litellm-1  | 08:56:47 - LiteLLM:DEBUG: utils.py:4453 - Error getting model info: This model isn't mapped yet. Add it here - https://github.com/BerriAI/litellm/blob/main/model_prices_and_context_window.json
litellm-1  | 08:56:47 - LiteLLM:DEBUG: utils.py:2052 - added/updated model=serval-llm/meta_llama_llama_3_1_8b_instruct_fake in litellm.model_cost: serval-llm/meta_llama_llama_3_1_8b_instruct_fake
litellm-1  | 08:56:47 - LiteLLM:DEBUG: utils.py:4244 - checking potential_model_names in litellm.model_cost: {'split_model': 'a3d830531240aebd7863a7a3f3dfee1468a2f6992029911f0c3168b94586a16c', 'combined_model_name': 'a3d830531240aebd7863a7a3f3dfee1468a2f6992029911f0c3168b94586a16c', 'stripped_model_name': 'a3d830531240aebd7863a7a3f3dfee1468a2f6992029911f0c3168b94586a16c', 'combined_stripped_model_name': 'a3d830531240aebd7863a7a3f3dfee1468a2f6992029911f0c3168b94586a16c', 'custom_llm_provider': None}
litellm-1  | 08:56:47 - LiteLLM:DEBUG: utils.py:4453 - Error getting model info: This model isn't mapped yet. Add it here - https://github.com/BerriAI/litellm/blob/main/model_prices_and_context_window.json
litellm-1  | 08:56:47 - LiteLLM:DEBUG: utils.py:2052 - added/updated model=a3d830531240aebd7863a7a3f3dfee1468a2f6992029911f0c3168b94586a16c in litellm.model_cost: a3d830531240aebd7863a7a3f3dfee1468a2f6992029911f0c3168b94586a16c
litellm-1  | 08:56:47 - LiteLLM:DEBUG: utils.py:4244 - checking potential_model_names in litellm.model_cost: {'split_model': 'gpt-4o', 'combined_model_name': 'azure/gpt-4o', 'stripped_model_name': 'azure/gpt-4o', 'combined_stripped_model_name': 'azure/gpt-4o', 'custom_llm_provider': 'azure'}
litellm-1  | 08:56:47 - LiteLLM:DEBUG: utils.py:4542 - model_info: {'key': 'azure/gpt-4o', 'max_tokens': 16384, 'max_input_tokens': 128000, 'max_output_tokens': 16384, 'input_cost_per_token': 2.5e-06, 'cache_creation_input_token_cost': None, 'cache_read_input_token_cost': 1.25e-06, 'input_cost_per_character': None, 'input_cost_per_token_above_128k_tokens': None, 'input_cost_per_token_above_200k_tokens': None, 'input_cost_per_query': None, 'input_cost_per_second': None, 'input_cost_per_audio_token': None, 'input_cost_per_token_batches': None, 'output_cost_per_token_batches': None, 'output_cost_per_token': 1e-05, 'output_cost_per_audio_token': None, 'output_cost_per_character': None, 'output_cost_per_reasoning_token': None, 'output_cost_per_token_above_128k_tokens': None, 'output_cost_per_character_above_128k_tokens': None, 'output_cost_per_token_above_200k_tokens': None, 'output_cost_per_second': None, 'output_cost_per_image': None, 'output_vector_size': None, 'litellm_provider': 'azure', 'mode': 'chat', 'supports_system_messages': None, 'supports_response_schema': True, 'supports_vision': True, 'supports_function_calling': True, 'supports_tool_choice': True, 'supports_assistant_prefill': False, 'supports_prompt_caching': True, 'supports_audio_input': False, 'supports_audio_output': False, 'supports_pdf_input': False, 'supports_embedding_image_input': False, 'supports_native_streaming': None, 'supports_web_search': False, 'supports_reasoning': False, 'search_context_cost_per_query': None, 'tpm': None, 'rpm': None}
litellm-1  | 08:56:47 - LiteLLM:DEBUG: utils.py:2052 - added/updated model=azure/gpt-4o in litellm.model_cost: azure/gpt-4o
litellm-1  | 08:56:47 - LiteLLM:DEBUG: utils.py:4244 - checking potential_model_names in litellm.model_cost: {'split_model': '5ed14060b96e244a34cd27b5d8cf708415fda0c5e9fe66656a2a68e564e82a2f', 'combined_model_name': '5ed14060b96e244a34cd27b5d8cf708415fda0c5e9fe66656a2a68e564e82a2f', 'stripped_model_name': '5ed14060b96e244a34cd27b5d8cf708415fda0c5e9fe66656a2a68e564e82a2f', 'combined_stripped_model_name': '5ed14060b96e244a34cd27b5d8cf708415fda0c5e9fe66656a2a68e564e82a2f', 'custom_llm_provider': None}
litellm-1  | 08:56:47 - LiteLLM:DEBUG: utils.py:4453 - Error getting model info: This model isn't mapped yet. Add it here - https://github.com/BerriAI/litellm/blob/main/model_prices_and_context_window.json
litellm-1  | 08:56:47 - LiteLLM:DEBUG: utils.py:2052 - added/updated model=5ed14060b96e244a34cd27b5d8cf708415fda0c5e9fe66656a2a68e564e82a2f in litellm.model_cost: 5ed14060b96e244a34cd27b5d8cf708415fda0c5e9fe66656a2a68e564e82a2f
litellm-1  | 08:56:47 - LiteLLM:DEBUG: utils.py:4244 - checking potential_model_names in litellm.model_cost: {'split_model': 'gpt4o_test', 'combined_model_name': 'azure/gpt4o_test', 'stripped_model_name': 'azure/gpt4o_test', 'combined_stripped_model_name': 'azure/gpt4o_test', 'custom_llm_provider': 'azure'}
litellm-1  | 08:56:47 - LiteLLM:DEBUG: utils.py:4453 - Error getting model info: This model isn't mapped yet. Add it here - https://github.com/BerriAI/litellm/blob/main/model_prices_and_context_window.json
litellm-1  | 08:56:47 - LiteLLM:DEBUG: utils.py:2052 - added/updated model=azure/gpt4o_test in litellm.model_cost: azure/gpt4o_test
litellm-1  | 08:56:47 - LiteLLM Router:DEBUG: router.py:4562 - 
litellm-1  | Initialized Model List ['qwen_qwen3_235b_a22b', 'meta_llama_llama_3_1_8b_instruct', 'meta_llama_llama_3_1_8b_instruct_fake', 'gpt4o', 'gpt4o_test']
litellm-1  | 08:56:47 - LiteLLM Router:INFO: router.py:656 - Routing strategy: simple-shuffle
litellm-1  | 08:56:47 - LiteLLM Router:DEBUG: router.py:541 - Intialized router with Routing strategy: simple-shuffle
litellm-1  | 
litellm-1  | Routing enable_pre_call_checks: False
litellm-1  | 
litellm-1  | Routing fallbacks: [{'meta_llama_llama_3_1_8b_instruct': ['qwen_qwen3_235b_a22b']}, {'meta_llama_llama_3_1_8b_instruct_fake': ['gpt4o']}, {'gpt4o': ['qwen_qwen3_235b_a22b']}, {'gpt4o_test': ['qwen_qwen3_235b_a22b']}]
litellm-1  | 
litellm-1  | Routing content fallbacks: None
litellm-1  | 
litellm-1  | Routing context window fallbacks: None
litellm-1  | 
litellm-1  | Router Redis Caching=None
litellm-1  | 
litellm-1  | 08:56:47 - LiteLLM Proxy:DEBUG: proxy_server.py:588 - prisma_client: None
litellm-1  | INFO:     Application startup complete.
litellm-1  | INFO:     Uvicorn running on http://*******:4000 (Press CTRL+C to quit)
litellm-1  | 08:56:49 - LiteLLM Proxy:DEBUG: common_request_processing.py:213 - Request received by LiteLLM:
litellm-1  | {
litellm-1  |     "model": "meta_llama_llama_3_1_8b_instruct_fake",
litellm-1  |     "messages": [
litellm-1  |         {
litellm-1  |             "role": "user",
litellm-1  |             "content": "what llm are you"
litellm-1  |         }
litellm-1  |     ],
litellm-1  |     "stream": true
litellm-1  | }
litellm-1  | 08:56:49 - LiteLLM Proxy:DEBUG: litellm_pre_call_utils.py:565 - Request Headers: Headers({'host': 'localhost:4000', 'user-agent': 'curl/8.7.1', 'accept': '*/*', 'content-type': 'application/json', 'content-length': '195'})
litellm-1  | 08:56:49 - LiteLLM Proxy:DEBUG: litellm_pre_call_utils.py:571 - receiving data: {'model': 'meta_llama_llama_3_1_8b_instruct_fake', 'messages': [{'role': 'user', 'content': 'what llm are you'}], 'stream': True, 'proxy_server_request': {'url': 'http://localhost:4000/chat/completions', 'method': 'POST', 'headers': {'host': 'localhost:4000', 'user-agent': 'curl/8.7.1', 'accept': '*/*', 'content-type': 'application/json', 'content-length': '195'}, 'body': {'model': 'meta_llama_llama_3_1_8b_instruct_fake', 'messages': [{'role': 'user', 'content': 'what llm are you'}], 'stream': True}}}
litellm-1  | 08:56:49 - LiteLLM Proxy:DEBUG: litellm_pre_call_utils.py:737 - [PROXY] returned data from litellm_pre_call_utils: {'model': 'meta_llama_llama_3_1_8b_instruct_fake', 'messages': [{'role': 'user', 'content': 'what llm are you'}], 'stream': True, 'proxy_server_request': {'url': 'http://localhost:4000/chat/completions', 'method': 'POST', 'headers': {'host': 'localhost:4000', 'user-agent': 'curl/8.7.1', 'accept': '*/*', 'content-type': 'application/json', 'content-length': '195'}, 'body': {'model': 'meta_llama_llama_3_1_8b_instruct_fake', 'messages': [{'role': 'user', 'content': 'what llm are you'}], 'stream': True}}, 'metadata': {'requester_metadata': {}, 'user_api_key_hash': None, 'user_api_key_alias': None, 'user_api_key_team_id': None, 'user_api_key_user_id': None, 'user_api_key_org_id': None, 'user_api_key_team_alias': None, 'user_api_key_end_user_id': None, 'user_api_key_user_email': None, 'user_api_key': None, 'user_api_end_user_max_budget': None, 'litellm_api_version': '1.70.0', 'global_max_parallel_requests': None, 'user_api_key_team_max_budget': None, 'user_api_key_team_spend': None, 'user_api_key_spend': 0.0, 'user_api_key_max_budget': None, 'user_api_key_model_max_budget': {}, 'user_api_key_metadata': {}, 'headers': {'host': 'localhost:4000', 'user-agent': 'curl/8.7.1', 'accept': '*/*', 'content-type': 'application/json', 'content-length': '195'}, 'endpoint': 'http://localhost:4000/chat/completions', 'litellm_parent_otel_span': None, 'requester_ip_address': ''}}
litellm-1  | 08:56:49 - LiteLLM Proxy:DEBUG: utils.py:524 - Inside Proxy Logging Pre-call hook!
litellm-1  | 08:56:49 - LiteLLM Proxy:DEBUG: max_budget_limiter.py:23 - Inside Max Budget Limiter Pre-Call Hook
litellm-1  | 08:56:49 - LiteLLM Proxy:DEBUG: parallel_request_limiter.py:48 - Inside Max Parallel Request Pre-Call Hook
litellm-1  | 08:56:49 - LiteLLM Proxy:DEBUG: cache_control_check.py:27 - Inside Cache Control Check Pre-Call Hook
litellm-1  | 08:56:49 - LiteLLM:DEBUG: utils.py:336 - Initialized litellm callbacks, Async Success Callbacks: [<bound method Router.deployment_callback_on_success of <litellm.router.Router object at 0x7ffff5aa0980>>, <litellm.proxy.hooks.model_max_budget_limiter._PROXY_VirtualKeyModelMaxBudgetLimiter object at 0x7ffff66a8590>, <litellm.proxy.hooks.max_budget_limiter._PROXY_MaxBudgetLimiter object at 0x7ffff5b634d0>, <litellm.proxy.hooks.parallel_request_limiter._PROXY_MaxParallelRequestsHandler object at 0x7ffff5b63610>, <litellm.proxy.hooks.cache_control_check._PROXY_CacheControlCheck object at 0x7ffff5b63390>, <litellm._service_logger.ServiceLogging object at 0x7ffff6716060>]
litellm-1  | 08:56:49 - LiteLLM:DEBUG: litellm_logging.py:458 - self.optional_params: {}
litellm-1  | 08:56:49 - LiteLLM Router:DEBUG: router.py:3547 - Inside async function with retries.
litellm-1  | 08:56:49 - LiteLLM Router:DEBUG: router.py:3567 - async function w/ retries: original_function - <bound method Router._acompletion of <litellm.router.Router object at 0x7ffff5aa0980>>, num_retries - 2
litellm-1  | 08:56:49 - LiteLLM Router:DEBUG: router.py:984 - Inside _acompletion()- model: meta_llama_llama_3_1_8b_instruct_fake; kwargs: {'proxy_server_request': {'url': 'http://localhost:4000/chat/completions', 'method': 'POST', 'headers': {'host': 'localhost:4000', 'user-agent': 'curl/8.7.1', 'accept': '*/*', 'content-type': 'application/json', 'content-length': '195'}, 'body': {'model': 'meta_llama_llama_3_1_8b_instruct_fake', 'messages': [{'role': 'user', 'content': 'what llm are you'}], 'stream': True}}, 'metadata': {'requester_metadata': {}, 'user_api_key_hash': None, 'user_api_key_alias': None, 'user_api_key_team_id': None, 'user_api_key_user_id': None, 'user_api_key_org_id': None, 'user_api_key_team_alias': None, 'user_api_key_end_user_id': None, 'user_api_key_user_email': None, 'user_api_key': None, 'user_api_end_user_max_budget': None, 'litellm_api_version': '1.70.0', 'global_max_parallel_requests': None, 'user_api_key_team_max_budget': None, 'user_api_key_team_spend': None, 'user_api_key_spend': 0.0, 'user_api_key_max_budget': None, 'user_api_key_model_max_budget': {}, 'user_api_key_metadata': {}, 'headers': {'host': 'localhost:4000', 'user-agent': 'curl/8.7.1', 'accept': '*/*', 'content-type': 'application/json', 'content-length': '195'}, 'endpoint': 'http://localhost:4000/chat/completions', 'litellm_parent_otel_span': None, 'requester_ip_address': '', 'model_group': 'meta_llama_llama_3_1_8b_instruct_fake', 'model_group_size': 1}, 'litellm_call_id': '************************************', 'litellm_logging_obj': <litellm.litellm_core_utils.litellm_logging.Logging object at 0x7ffff5aa34d0>, 'stream': True, 'litellm_trace_id': '************************************'}
litellm-1  | 08:56:49 - LiteLLM Router:DEBUG: router.py:6012 - initial list of deployments: [{'model_name': 'meta_llama_llama_3_1_8b_instruct_fake', 'litellm_params': {'use_in_pass_through': False, 'use_litellm_proxy': False, 'merge_reasoning_content_in_choices': False, 'model': 'serval-llm/meta_llama_llama_3_1_8b_instruct_fake', 'api_url': '**************'}, 'model_info': {'id': '433db61fdae9146da69d1933b30aabe56a2db9448e90927fdcc2a12e6e470fc4', 'db_model': False}}]
litellm-1  | 08:56:49 - LiteLLM Router:DEBUG: cooldown_handlers.py:326 - retrieve cooldown models: []
litellm-1  | 08:56:49 - LiteLLM Router:DEBUG: router.py:6062 - async cooldown deployments: []
litellm-1  | 08:56:49 - LiteLLM Router:DEBUG: router.py:6065 - cooldown_deployments: []
litellm-1  | 08:56:49 - LiteLLM Router:DEBUG: router.py:6385 - cooldown deployments: []
litellm-1  | 08:56:49 - LiteLLM:DEBUG: utils.py:336 - 
litellm-1  | 
litellm-1  | 08:56:49 - LiteLLM:DEBUG: utils.py:336 - Request to litellm:
litellm-1  | 08:56:49 - LiteLLM:DEBUG: utils.py:336 - litellm.acompletion(use_in_pass_through=False, use_litellm_proxy=False, merge_reasoning_content_in_choices=False, model='serval-llm/meta_llama_llama_3_1_8b_instruct_fake', api_url='****************', messages=[{'role': 'user', 'content': 'what llm are you'}], caching=False, client=None, proxy_server_request={'url': 'http://localhost:4000/chat/completions', 'method': 'POST', 'headers': {'host': 'localhost:4000', 'user-agent': 'curl/8.7.1', 'accept': '*/*', 'content-type': 'application/json', 'content-length': '195'}, 'body': {'model': 'meta_llama_llama_3_1_8b_instruct_fake', 'messages': [{'role': 'user', 'content': 'what llm are you'}], 'stream': True}}, metadata={'requester_metadata': {}, 'user_api_key_hash': None, 'user_api_key_alias': None, 'user_api_key_team_id': None, 'user_api_key_user_id': None, 'user_api_key_org_id': None, 'user_api_key_team_alias': None, 'user_api_key_end_user_id': None, 'user_api_key_user_email': None, 'user_api_key': None, 'user_api_end_user_max_budget': None, 'litellm_api_version': '1.70.0', 'global_max_parallel_requests': None, 'user_api_key_team_max_budget': None, 'user_api_key_team_spend': None, 'user_api_key_spend': 0.0, 'user_api_key_max_budget': None, 'user_api_key_model_max_budget': {}, 'user_api_key_metadata': {}, 'headers': {'host': 'localhost:4000', 'user-agent': 'curl/8.7.1', 'accept': '*/*', 'content-type': 'application/json', 'content-length': '195'}, 'endpoint': 'http://localhost:4000/chat/completions', 'litellm_parent_otel_span': None, 'requester_ip_address': '', 'model_group': 'meta_llama_llama_3_1_8b_instruct_fake', 'model_group_size': 1, 'deployment': 'serval-llm/meta_llama_llama_3_1_8b_instruct_fake', 'model_info': {'id': '433db61fdae9146da69d1933b30aabe56a2db9448e90927fdcc2a12e6e470fc4', 'db_model': False}, 'api_base': None, 'caching_groups': None}, litellm_call_id='************************************', litellm_logging_obj=<litellm.litellm_core_utils.litellm_logging.Logging object at 0x7ffff5aa34d0>, stream=True, litellm_trace_id='************************************', model_info={'id': '433db61fdae9146da69d1933b30aabe56a2db9448e90927fdcc2a12e6e470fc4', 'db_model': False}, timeout=10, max_retries=0)
litellm-1  | 08:56:49 - LiteLLM:DEBUG: utils.py:336 - 
litellm-1  | 
litellm-1  | 08:56:49 - LiteLLM:DEBUG: utils.py:336 - ASYNC kwargs[caching]: False; litellm.cache: None; kwargs.get('cache'): None
litellm-1  | 08:56:49 - LiteLLM:DEBUG: caching_handler.py:211 - CACHE RESULT: None
litellm-1  | 08:56:49 - LiteLLM:INFO: utils.py:2904 - 
litellm-1  | LiteLLM completion() model= meta_llama_llama_3_1_8b_instruct_fake; provider = serval-llm
litellm-1  | 08:56:49 - LiteLLM:DEBUG: utils.py:2907 - 
litellm-1  | LiteLLM: Params passed to completion() {'model': 'meta_llama_llama_3_1_8b_instruct_fake', 'functions': None, 'function_call': None, 'temperature': None, 'top_p': None, 'n': None, 'stream': True, 'stream_options': None, 'stop': None, 'max_tokens': None, 'max_completion_tokens': None, 'modalities': None, 'prediction': None, 'audio': None, 'presence_penalty': None, 'frequency_penalty': None, 'logit_bias': None, 'user': None, 'custom_llm_provider': 'serval-llm', 'response_format': None, 'seed': None, 'tools': None, 'tool_choice': None, 'max_retries': 0, 'logprobs': None, 'top_logprobs': None, 'extra_headers': None, 'api_version': None, 'parallel_tool_calls': None, 'drop_params': None, 'allowed_openai_params': None, 'reasoning_effort': None, 'additional_drop_params': None, 'messages': [{'role': 'user', 'content': 'what llm are you'}], 'thinking': None, 'web_search_options': None, 'api_url':  '*************'}
litellm-1  | 08:56:49 - LiteLLM:DEBUG: utils.py:2910 - 
litellm-1  | LiteLLM: Non-Default params passed to completion() {'stream': True, 'max_retries': 0}
litellm-1  | 08:56:49 - LiteLLM:DEBUG: utils.py:336 - Final returned optional params: {'stream': True, 'max_retries': 0, 'api_url':  '*************'}
litellm-1  | 08:56:49 - LiteLLM:DEBUG: litellm_logging.py:458 - self.optional_params: {'stream': True, 'max_retries': 0, 'api_url': '*************'}
litellm-1  | 08:56:49 - LiteLLM:INFO: cost_calculator.py:655 - selected model name for cost calculation: serval-llm/serval-llm/meta_llama_llama_3_1_8b_instruct_fake
litellm-1  | 08:56:49 - LiteLLM:DEBUG: token_counter.py:365 - messages in token_counter: None, text in token_counter: 
litellm-1  | 08:56:49 - LiteLLM:DEBUG: utils.py:4244 - checking potential_model_names in litellm.model_cost: {'split_model': 'meta_llama_llama_3_1_8b_instruct_fake', 'combined_model_name': 'serval-llm/meta_llama_llama_3_1_8b_instruct_fake', 'stripped_model_name': 'meta_llama_llama_3_1_8b_instruct_fake', 'combined_stripped_model_name': 'serval-llm/meta_llama_llama_3_1_8b_instruct_fake', 'custom_llm_provider': 'serval-llm'}
litellm-1  | 08:56:49 - LiteLLM:DEBUG: utils.py:4341 - model=serval-llm/meta_llama_llama_3_1_8b_instruct_fake, custom_llm_provider=serval-llm has no input_cost_per_token in model_cost_map. Defaulting to 0.
litellm-1  | 08:56:49 - LiteLLM:DEBUG: utils.py:4353 - model=serval-llm/meta_llama_llama_3_1_8b_instruct_fake, custom_llm_provider=serval-llm has no output_cost_per_token in model_cost_map. Defaulting to 0.
litellm-1  | 08:56:49 - LiteLLM:DEBUG: cost_calculator.py:376 - Returned custom cost for model=serval-llm/meta_llama_llama_3_1_8b_instruct_fake - prompt_tokens_cost_usd_dollar: 0, completion_tokens_cost_usd_dollar: 0
litellm-1  | 08:56:49 - LiteLLM:DEBUG: litellm_logging.py:1124 - response_cost: 0.0
litellm-1  | 08:56:49 - LiteLLM Router:INFO: router.py:1081 - litellm.acompletion(model=serval-llm/meta_llama_llama_3_1_8b_instruct_fake) 200 OK
litellm-1  | 08:56:49 - LiteLLM Router:DEBUG: router.py:3307 - Async Response: <litellm.litellm_core_utils.streaming_handler.CustomStreamWrapper object at 0x7ffff5aa3b60>
litellm-1  | 08:56:49 - LiteLLM Proxy:DEBUG: proxy_server.py:3011 - inside generator
litellm-1  | 08:56:49 - LiteLLM:DEBUG: litellm_logging.py:2159 - Logging Details LiteLLM-Failure Call: [<bound method Router.deployment_callback_on_failure of <litellm.router.Router object at 0x7ffff5aa0980>>, <litellm.proxy.hooks.model_max_budget_limiter._PROXY_VirtualKeyModelMaxBudgetLimiter object at 0x7ffff66a8590>, <litellm.proxy.hooks.max_budget_limiter._PROXY_MaxBudgetLimiter object at 0x7ffff5b634d0>, <litellm.proxy.hooks.parallel_request_limiter._PROXY_MaxParallelRequestsHandler object at 0x7ffff5b63610>, <litellm.proxy.hooks.cache_control_check._PROXY_CacheControlCheck object at 0x7ffff5b63390>, <litellm._service_logger.ServiceLogging object at 0x7ffff6716060>]
litellm-1  | 08:56:49 - LiteLLM:DEBUG: get_api_base.py:62 - Error occurred in getting api base - litellm.BadRequestError: LLM Provider NOT provided. Pass in the LLM provider you are trying to call. You passed model=meta_llama_llama_3_1_8b_instruct_fake
litellm-1  |  Pass model as E.g. For 'Huggingface' inference endpoints pass in `completion(model='huggingface/starcoder',..)` Learn more: https://docs.litellm.ai/docs/providers
litellm-1  | 08:56:49 - LiteLLM:DEBUG: exception_mapping_utils.py:2261 - Logging Details: logger_fn - None | callable(logger_fn) - False
litellm-1  | 08:56:49 - LiteLLM Proxy:ERROR: proxy_server.py:3038 - litellm.proxy.proxy_server.async_data_generator(): Exception occured - litellm.APIConnectionError: Not implemented yet!
litellm-1  | Traceback (most recent call last):
litellm-1  |   File "/home/xxx/app/lib/python3.13/site-packages/litellm/litellm_core_utils/streaming_handler.py", line 1670, in __anext__
litellm-1  |     async for chunk in self.completion_stream:
litellm-1  |     ...<48 lines>...
litellm-1  |         return processed_chunk
litellm-1  |   File "/home/xxx/app/litellm_model.py", line 185, in astreaming
litellm-1  |     raise CustomLLMError(status_code=500, message="Not implemented yet!")
litellm-1  | litellm_model.CustomLLMError: Not implemented yet!
litellm-1  | Traceback (most recent call last):
litellm-1  |   File "/home/xxx/app/lib/python3.13/site-packages/litellm/litellm_core_utils/streaming_handler.py", line 1670, in __anext__
litellm-1  |     async for chunk in self.completion_stream:
litellm-1  |     ...<48 lines>...
litellm-1  |         return processed_chunk
litellm-1  |   File "/home/xxx/app/litellm_model.py", line 185, in astreaming
litellm-1  |     raise CustomLLMError(status_code=500, message="Not implemented yet!")
litellm-1  | litellm_model.CustomLLMError: Not implemented yet!
litellm-1  | 
litellm-1  | During handling of the above exception, another exception occurred:
litellm-1  | 
litellm-1  | Traceback (most recent call last):
litellm-1  |   File "/home/xxx/app/lib/python3.13/site-packages/litellm/proxy/proxy_server.py", line 3013, in async_data_generator
litellm-1  |     async for chunk in proxy_logging_obj.async_post_call_streaming_iterator_hook(
litellm-1  |     ...<18 lines>...
litellm-1  |             yield f"data: {str(e)}\n\n"
litellm-1  |   File "/home/xxx/app/lib/python3.13/site-packages/litellm/integrations/custom_logger.py", line 288, in async_post_call_streaming_iterator_hook
litellm-1  |     async for item in response:
litellm-1  |         yield item
litellm-1  |   File "/home/xxx/app/lib/python3.13/site-packages/litellm/integrations/custom_logger.py", line 288, in async_post_call_streaming_iterator_hook
litellm-1  |     async for item in response:
litellm-1  |         yield item
litellm-1  |   File "/home/xxx/app/lib/python3.13/site-packages/litellm/integrations/custom_logger.py", line 288, in async_post_call_streaming_iterator_hook
litellm-1  |     async for item in response:
litellm-1  |         yield item
litellm-1  |   [Previous line repeated 2 more times]
litellm-1  |   File "/home/xxx/app/lib/python3.13/site-packages/litellm/litellm_core_utils/streaming_handler.py", line 1831, in __anext__
litellm-1  |     raise exception_type(
litellm-1  |           ~~~~~~~~~~~~~~^
litellm-1  |         model=self.model,
litellm-1  |         ^^^^^^^^^^^^^^^^^
litellm-1  |     ...<3 lines>...
litellm-1  |         extra_kwargs={},
litellm-1  |         ^^^^^^^^^^^^^^^^
litellm-1  |     )
litellm-1  |     ^
litellm-1  |   File "/home/xxx/app/lib/python3.13/site-packages/litellm/litellm_core_utils/exception_mapping_utils.py", line 2232, in exception_type
litellm-1  |     raise e
litellm-1  |   File "/home/xxx/app/lib/python3.13/site-packages/litellm/litellm_core_utils/exception_mapping_utils.py", line 2208, in exception_type
litellm-1  |     raise APIConnectionError(
litellm-1  |     ...<8 lines>...
litellm-1  |     )
litellm-1  | litellm.exceptions.APIConnectionError: litellm.APIConnectionError: Not implemented yet!
litellm-1  | Traceback (most recent call last):
litellm-1  |   File "/home/xxx/app/lib/python3.13/site-packages/litellm/litellm_core_utils/streaming_handler.py", line 1670, in __anext__
litellm-1  |     async for chunk in self.completion_stream:
litellm-1  |     ...<48 lines>...
litellm-1  |         return processed_chunk
litellm-1  |   File "/home/xxx/app/litellm_model.py", line 185, in astreaming
litellm-1  |     raise CustomLLMError(status_code=500, message="Not implemented yet!")
litellm-1  | litellm_model.CustomLLMError: Not implemented yet!
litellm-1  | 
litellm-1  | 08:56:49 - LiteLLM Router:DEBUG: router.py:4019 - Router: Entering 'deployment_callback_on_failure'
litellm-1  | 08:56:49 - LiteLLM Proxy:DEBUG: proxy_server.py:3048 - An error occurred: litellm.APIConnectionError: Not implemented yet!
litellm-1  | Traceback (most recent call last):
litellm-1  |   File "/home/xxx/app/lib/python3.13/site-packages/litellm/litellm_core_utils/streaming_handler.py", line 1670, in __anext__
litellm-1  |     async for chunk in self.completion_stream:
litellm-1  |     ...<48 lines>...
litellm-1  |         return processed_chunk
litellm-1  |   File "/home/xxx/app/litellm_model.py", line 185, in astreaming
litellm-1  |     raise CustomLLMError(status_code=500, message="Not implemented yet!")
litellm-1  | litellm_model.CustomLLMError: Not implemented yet!
litellm-1  | 
litellm-1  | 
litellm-1  |  Debug this by setting `--debug`, e.g. `litellm --model gpt-3.5-turbo --debug`
litellm-1  | 08:56:49 - LiteLLM Router:DEBUG: cooldown_handlers.py:260 - checks 'should_run_cooldown_logic'
litellm-1  | 08:56:49 - LiteLLM Router:DEBUG: cooldown_handlers.py:274 - Attempting to add 433db61fdae9146da69d1933b30aabe56a2db9448e90927fdcc2a12e6e470fc4 to cooldown list
litellm-1  | 08:56:49 - LiteLLM Router:DEBUG: cooldown_handlers.py:199 - percent fails for deployment = 433db61fdae9146da69d1933b30aabe56a2db9448e90927fdcc2a12e6e470fc4, percent fails = 1.0, num successes = 0, num fails = 1
litellm-1  | INFO:     ************:51493 - "POST /chat/completions HTTP/1.1" 200 OK
litellm-1  | 08:56:49 - LiteLLM Proxy:DEBUG: parallel_request_limiter.py:48 - Inside Max Parallel Request Failure Hook
litellm-1  | 08:56:49 - LiteLLM Proxy:DEBUG: parallel_request_limiter.py:48 - user_api_key: None

Are you a ML Ops Team?

Yes

What LiteLLM version are you on ?

v1.69.0.patch1

Twitter / LinkedIn details

No response

hahamark1 avatar May 20 '25 08:05 hahamark1

Same error,my version is 1.74.9 I think the reason is that, in non-streaming mode, await init_response raises an exception that is then caught. https://github.com/BerriAI/litellm/blob/c99277c51736a331d1deeb49339adba997fa1b42/litellm/main.py#L544 However, when using streaming mode, init_response is an instance of CustomStreamWrapper, and any exceptions are handled within the flowing function, which is outside of async_function_with_fallbacks. https://github.com/BerriAI/litellm/blob/c99277c51736a331d1deeb49339adba997fa1b42/litellm/proxy/proxy_server.py#L3462 @krrishdholakia Could you please take a look and advise on how this can be addressed?

KyleZhang0536 avatar Aug 18 '25 09:08 KyleZhang0536

related code https://github.com/BerriAI/litellm/blob/c99277c51736a331d1deeb49339adba997fa1b42/litellm/main.py#L3485-L3516

KyleZhang0536 avatar Aug 18 '25 12:08 KyleZhang0536

Hi, @ishaan-jaff Could you please take a look and advise on how this can be addressed?

KyleZhang0536 avatar Aug 19 '25 03:08 KyleZhang0536

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

github-actions[bot] avatar Nov 18 '25 00:11 github-actions[bot]

bump

hewliyang avatar Nov 21 '25 14:11 hewliyang