[Bug]: max_completion_tokens is not supported for some gpt-4o deployments on Azure
What happened?
import litellm
response = litellm.completion(
model = "azure/gpt-4o",
api_base = "https://XXX.openai.azure.com/",
api_version = "2024-08-01-preview",
api_key = "XXX",
messages = [{"role": "user", "content": "hi"}],
max_completion_tokens=100,
)
print(response)
the above returns
BadRequestError: litellm.BadRequestError: AzureException BadRequestError - Error code: 400 - {'error': {'message': 'Unrecognized request argument supplied: max_completion_tokens', 'type': 'invalid_request_error', 'param': None, 'code': None}}
But when I comment max_completion_tokens I get the correct result.
I verified that my deployment does not support this parameter, despite the azure documents says it should support it.
❯ curl -X POST 'https://XXX.openai.azure.com/openai/deployments/gpt-4o/chat/completions?api-version=2024-08-01-preview' \
-H "api-key: $API_KEY" \
-H "content-type: application/json" \
--data '{"messages": [{"role": "user", "content": "Ping!"}], "temperature": 0.0, "max_completion_tokens": 100}'
{
"error": {
"message": "Unrecognized request argument supplied: max_completion_tokens",
"type": "invalid_request_error",
"param": null,
"code": null
}
}%
I tried using multiple other api-version but none helped.
similar issue was reported in this azure openai forum post:
I'd like to add here that we have an openai deployment which intermittently errors out when using max_completion_tokens. This is when using a gpt-4o model on API version 2024-10-21. ... This is pretty strange because repeatedly calling the same endpoint results in successes about 90% of the time. The other 10% of the time we get this strange error. Not sure if this is a known issue. Turns out there was a scaling issue on the MS side, and this has since been fixed.
I believe the root cause is from Azure since their documentation shows that this parameters should be supported and it does not say it's only supported for specific models (o-series).
Related PR:
Add support for max_completion_tokens in Azure OpenAI #6376
Relevant log output
Give Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new
LiteLLM.Info: If you need to debug this error, use `litellm._turn_on_debug()'.
---------------------------------------------------------------------------
BadRequestError Traceback (most recent call last)
File ~/code/testing/.conda/lib/python3.11/site-packages/litellm/llms/azure/azure.py:516, in AzureChatCompletion.completion(self, model, messages, model_response, api_key, api_base, api_version, api_type, azure_ad_token, azure_ad_token_provider, dynamic_params, print_verbose, timeout, logging_obj, optional_params, litellm_params, logger_fn, acompletion, headers, client)
511 raise AzureOpenAIError(
512 status_code=500,
513 message="azure_client is not an instance of AzureOpenAI",
514 )
--> 516 headers, response = self.make_sync_azure_openai_chat_completion_request(
517 azure_client=azure_client, data=data, timeout=timeout
518 )
519 stringified_response = response.model_dump()
File ~/code/testing/.conda/lib/python3.11/site-packages/litellm/llms/azure/azure.py:295, in AzureChatCompletion.make_sync_azure_openai_chat_completion_request(self, azure_client, data, timeout)
294 except Exception as e:
--> 295 raise e
File ~/code/testing/.conda/lib/python3.11/site-packages/litellm/llms/azure/azure.py:287, in AzureChatCompletion.make_sync_azure_openai_chat_completion_request(self, azure_client, data, timeout)
286 try:
--> 287 raw_response = azure_client.chat.completions.with_raw_response.create(
288 **data, timeout=timeout
289 )
291 headers = dict(raw_response.headers)
File ~/code/testing/.conda/lib/python3.11/site-packages/openai/_legacy_response.py:364, in to_raw_response_wrapper.<locals>.wrapped(*args, **kwargs)
362 kwargs["extra_headers"] = extra_headers
--> 364 return cast(LegacyAPIResponse[R], func(*args, **kwargs))
File ~/code/testing/.conda/lib/python3.11/site-packages/openai/_utils/_utils.py:279, in required_args.<locals>.inner.<locals>.wrapper(*args, **kwargs)
278 raise TypeError(msg)
--> 279 return func(*args, **kwargs)
File ~/code/testing/.conda/lib/python3.11/site-packages/openai/resources/chat/completions/completions.py:879, in Completions.create(self, messages, model, audio, frequency_penalty, function_call, functions, logit_bias, logprobs, max_completion_tokens, max_tokens, metadata, modalities, n, parallel_tool_calls, prediction, presence_penalty, reasoning_effort, response_format, seed, service_tier, stop, store, stream, stream_options, temperature, tool_choice, tools, top_logprobs, top_p, user, extra_headers, extra_query, extra_body, timeout)
878 validate_response_format(response_format)
--> 879 return self._post(
880 "/chat/completions",
881 body=maybe_transform(
882 {
883 "messages": messages,
884 "model": model,
885 "audio": audio,
886 "frequency_penalty": frequency_penalty,
887 "function_call": function_call,
888 "functions": functions,
889 "logit_bias": logit_bias,
890 "logprobs": logprobs,
891 "max_completion_tokens": max_completion_tokens,
892 "max_tokens": max_tokens,
893 "metadata": metadata,
894 "modalities": modalities,
895 "n": n,
896 "parallel_tool_calls": parallel_tool_calls,
897 "prediction": prediction,
898 "presence_penalty": presence_penalty,
899 "reasoning_effort": reasoning_effort,
900 "response_format": response_format,
901 "seed": seed,
902 "service_tier": service_tier,
903 "stop": stop,
904 "store": store,
905 "stream": stream,
906 "stream_options": stream_options,
907 "temperature": temperature,
908 "tool_choice": tool_choice,
909 "tools": tools,
910 "top_logprobs": top_logprobs,
911 "top_p": top_p,
912 "user": user,
913 },
914 completion_create_params.CompletionCreateParams,
915 ),
916 options=make_request_options(
917 extra_headers=extra_headers, extra_query=extra_query, extra_body=extra_body, timeout=timeout
918 ),
919 cast_to=ChatCompletion,
920 stream=stream or False,
921 stream_cls=Stream[ChatCompletionChunk],
922 )
File ~/code/testing/.conda/lib/python3.11/site-packages/openai/_base_client.py:1290, in SyncAPIClient.post(self, path, cast_to, body, options, files, stream, stream_cls)
1287 opts = FinalRequestOptions.construct(
1288 method="post", url=path, json_data=body, files=to_httpx_files(files), **options
1289 )
-> 1290 return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls))
File ~/code/testing/.conda/lib/python3.11/site-packages/openai/_base_client.py:967, in SyncAPIClient.request(self, cast_to, options, remaining_retries, stream, stream_cls)
965 retries_taken = 0
--> 967 return self._request(
968 cast_to=cast_to,
969 options=options,
970 stream=stream,
971 stream_cls=stream_cls,
972 retries_taken=retries_taken,
973 )
File ~/code/testing/.conda/lib/python3.11/site-packages/openai/_base_client.py:1071, in SyncAPIClient._request(self, cast_to, options, retries_taken, stream, stream_cls)
1070 log.debug("Re-raising status error")
-> 1071 raise self._make_status_error_from_response(err.response) from None
1073 return self._process_response(
1074 cast_to=cast_to,
1075 options=options,
(...)
1079 retries_taken=retries_taken,
1080 )
BadRequestError: Error code: 400 - {'error': {'message': 'Unrecognized request argument supplied: max_completion_tokens', 'type': 'invalid_request_error', 'param': None, 'code': None}}
During handling of the above exception, another exception occurred:
AzureOpenAIError Traceback (most recent call last)
File ~/code/testing/.conda/lib/python3.11/site-packages/litellm/main.py:1269, in completion(model, messages, timeout, temperature, top_p, n, stream, stream_options, stop, max_completion_tokens, max_tokens, modalities, prediction, audio, presence_penalty, frequency_penalty, logit_bias, user, reasoning_effort, response_format, seed, tools, tool_choice, logprobs, top_logprobs, parallel_tool_calls, deployment_id, extra_headers, functions, function_call, base_url, api_version, api_key, model_list, **kwargs)
1268 ## COMPLETION CALL
-> 1269 response = azure_chat_completions.completion(
1270 model=model,
1271 messages=messages,
1272 headers=headers,
1273 api_key=api_key,
1274 api_base=api_base,
1275 api_version=api_version,
1276 api_type=api_type,
1277 dynamic_params=dynamic_params,
1278 azure_ad_token=azure_ad_token,
1279 azure_ad_token_provider=azure_ad_token_provider,
1280 model_response=model_response,
1281 print_verbose=print_verbose,
1282 optional_params=optional_params,
1283 litellm_params=litellm_params,
1284 logger_fn=logger_fn,
1285 logging_obj=logging,
1286 acompletion=acompletion,
1287 timeout=timeout, # type: ignore
1288 client=client, # pass AsyncAzureOpenAI, AzureOpenAI client
1289 )
1291 if optional_params.get("stream", False):
1292 ## LOGGING
File ~/code/testing/.conda/lib/python3.11/site-packages/litellm/llms/azure/azure.py:545, in AzureChatCompletion.completion(self, model, messages, model_response, api_key, api_base, api_version, api_type, azure_ad_token, azure_ad_token_provider, dynamic_params, print_verbose, timeout, logging_obj, optional_params, litellm_params, logger_fn, acompletion, headers, client)
544 error_headers = getattr(error_response, "headers", None)
--> 545 raise AzureOpenAIError(
546 status_code=status_code, message=str(e), headers=error_headers
547 )
AzureOpenAIError: Error code: 400 - {'error': {'message': 'Unrecognized request argument supplied: max_completion_tokens', 'type': 'invalid_request_error', 'param': None, 'code': None}}
During handling of the above exception, another exception occurred:
BadRequestError Traceback (most recent call last)
Cell In[6], line 4
1 import litellm
3 # azure call
----> 4 response = litellm.completion(
5 model = "azure/gpt-4o",
6 api_base = "https://XXX.openai.azure.com/",
7 api_version = "2024-08-01-preview",
8 api_key = "XXX",
9 messages = [{"role": "user", "content": "good morning"}],
10 max_completion_tokens=100,
11 )
13 print(response)
File ~/code/testing/.conda/lib/python3.11/site-packages/litellm/utils.py:1190, in client.<locals>.wrapper(*args, **kwargs)
1186 if logging_obj:
1187 logging_obj.failure_handler(
1188 e, traceback_exception, start_time, end_time
1189 ) # DO NOT MAKE THREADED - router retry fallback relies on this!
-> 1190 raise e
File ~/code/testing/.conda/lib/python3.11/site-packages/litellm/utils.py:1068, in client.<locals>.wrapper(*args, **kwargs)
1066 print_verbose(f"Error while checking max token limit: {str(e)}")
1067 # MODEL CALL
-> 1068 result = original_function(*args, **kwargs)
1069 end_time = datetime.datetime.now()
1070 if "stream" in kwargs and kwargs["stream"] is True:
File ~/code/testing/.conda/lib/python3.11/site-packages/litellm/main.py:3085, in completion(model, messages, timeout, temperature, top_p, n, stream, stream_options, stop, max_completion_tokens, max_tokens, modalities, prediction, audio, presence_penalty, frequency_penalty, logit_bias, user, reasoning_effort, response_format, seed, tools, tool_choice, logprobs, top_logprobs, parallel_tool_calls, deployment_id, extra_headers, functions, function_call, base_url, api_version, api_key, model_list, **kwargs)
3082 return response
3083 except Exception as e:
3084 ## Map to OpenAI Exception
-> 3085 raise exception_type(
3086 model=model,
3087 custom_llm_provider=custom_llm_provider,
3088 original_exception=e,
3089 completion_kwargs=args,
3090 extra_kwargs=kwargs,
3091 )
File ~/code/testing/.conda/lib/python3.11/site-packages/litellm/litellm_core_utils/exception_mapping_utils.py:2202, in exception_type(model, original_exception, custom_llm_provider, completion_kwargs, extra_kwargs)
2200 if exception_mapping_worked:
2201 setattr(e, "litellm_response_headers", litellm_response_headers)
-> 2202 raise e
2203 else:
2204 for error_type in litellm.LITELLM_EXCEPTION_TYPES:
File ~/code/testing/.conda/lib/python3.11/site-packages/litellm/litellm_core_utils/exception_mapping_utils.py:1949, in exception_type(model, original_exception, custom_llm_provider, completion_kwargs, extra_kwargs)
1947 elif "invalid_request_error" in error_str:
1948 exception_mapping_worked = True
-> 1949 raise BadRequestError(
1950 message=f"AzureException BadRequestError - {message}",
1951 llm_provider="azure",
1952 model=model,
1953 litellm_debug_info=extra_information,
1954 response=getattr(original_exception, "response", None),
1955 )
1956 elif (
1957 "The api_key client option must be set either by passing api_key to the client or by setting"
1958 in error_str
1959 ):
1960 exception_mapping_worked = True
Are you a ML Ops Team?
No
What LiteLLM version are you on ?
v1.61.17
Twitter / LinkedIn details
https://www.linkedin.com/in/panahi
is the ask to map max_completion_tokens to max_tokens like we do on the older openai models? @c3-ali
@krrishdholakia yes.
added to roadmap
I couldnt replicate and this PR https://github.com/BerriAI/litellm/pull/6376 is merged, I take it this is solved @c3-ali ?
Hmm i don't see code that fixes this on main - https://github.com/BerriAI/litellm/blob/3875df666b8a11819eda86fdc35d582be4bd8db6/litellm/llms/azure/chat/gpt_transformation.py#L4
@krrishdholakia
I not sure exactly how but I would suspect Azure added summopt for max_completion_tokens
I was testing with this https://github.com/CakeCrusher/litellm/commit/66167011f35b88cf407fae4169d00ac567b8b1a8#diff-4b41e1a3e65a7327300b012d7e6c49acb8daf8ae0412f76a64a05d5b999b2783 and it was working now.
At the end of the day it was an inconsistency on their end.
@CakeCrusher yes, this is inconsistency from azure and I randomly (similar to others) get the above issue for our deployment. I created a new azure deployment and do not experience this issue anymore. I'm ok if we close this issue.