litellm icon indicating copy to clipboard operation
litellm copied to clipboard

[Bug]: max_completion_tokens is not supported for some gpt-4o deployments on Azure

Open c3-ali opened this issue 10 months ago • 6 comments

What happened?

import litellm

response = litellm.completion(
    model = "azure/gpt-4o",
    api_base = "https://XXX.openai.azure.com/",
    api_version = "2024-08-01-preview",
    api_key = "XXX",
    messages = [{"role": "user", "content": "hi"}],
    max_completion_tokens=100,
)

print(response)

the above returns

BadRequestError: litellm.BadRequestError: AzureException BadRequestError - Error code: 400 - {'error': {'message': 'Unrecognized request argument supplied: max_completion_tokens', 'type': 'invalid_request_error', 'param': None, 'code': None}}

But when I comment max_completion_tokens I get the correct result.

I verified that my deployment does not support this parameter, despite the azure documents says it should support it.

❯ curl -X POST 'https://XXX.openai.azure.com/openai/deployments/gpt-4o/chat/completions?api-version=2024-08-01-preview' \
     -H "api-key: $API_KEY" \
     -H "content-type: application/json" \
     --data '{"messages": [{"role": "user", "content": "Ping!"}], "temperature": 0.0, "max_completion_tokens": 100}'
{
  "error": {
    "message": "Unrecognized request argument supplied: max_completion_tokens",
    "type": "invalid_request_error",
    "param": null,
    "code": null
  }
}%    

I tried using multiple other api-version but none helped.

similar issue was reported in this azure openai forum post:

I'd like to add here that we have an openai deployment which intermittently errors out when using max_completion_tokens. This is when using a gpt-4o model on API version 2024-10-21. ... This is pretty strange because repeatedly calling the same endpoint results in successes about 90% of the time. The other 10% of the time we get this strange error. Not sure if this is a known issue. Turns out there was a scaling issue on the MS side, and this has since been fixed.

I believe the root cause is from Azure since their documentation shows that this parameters should be supported and it does not say it's only supported for specific models (o-series).

Related PR:

Add support for max_completion_tokens in Azure OpenAI #6376

Relevant log output

Give Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new
LiteLLM.Info: If you need to debug this error, use `litellm._turn_on_debug()'.

---------------------------------------------------------------------------
BadRequestError                           Traceback (most recent call last)
File ~/code/testing/.conda/lib/python3.11/site-packages/litellm/llms/azure/azure.py:516, in AzureChatCompletion.completion(self, model, messages, model_response, api_key, api_base, api_version, api_type, azure_ad_token, azure_ad_token_provider, dynamic_params, print_verbose, timeout, logging_obj, optional_params, litellm_params, logger_fn, acompletion, headers, client)
    511     raise AzureOpenAIError(
    512         status_code=500,
    513         message="azure_client is not an instance of AzureOpenAI",
    514     )
--> 516 headers, response = self.make_sync_azure_openai_chat_completion_request(
    517     azure_client=azure_client, data=data, timeout=timeout
    518 )
    519 stringified_response = response.model_dump()

File ~/code/testing/.conda/lib/python3.11/site-packages/litellm/llms/azure/azure.py:295, in AzureChatCompletion.make_sync_azure_openai_chat_completion_request(self, azure_client, data, timeout)
    294 except Exception as e:
--> 295     raise e

File ~/code/testing/.conda/lib/python3.11/site-packages/litellm/llms/azure/azure.py:287, in AzureChatCompletion.make_sync_azure_openai_chat_completion_request(self, azure_client, data, timeout)
    286 try:
--> 287     raw_response = azure_client.chat.completions.with_raw_response.create(
    288         **data, timeout=timeout
    289     )
    291     headers = dict(raw_response.headers)

File ~/code/testing/.conda/lib/python3.11/site-packages/openai/_legacy_response.py:364, in to_raw_response_wrapper.<locals>.wrapped(*args, **kwargs)
    362 kwargs["extra_headers"] = extra_headers
--> 364 return cast(LegacyAPIResponse[R], func(*args, **kwargs))

File ~/code/testing/.conda/lib/python3.11/site-packages/openai/_utils/_utils.py:279, in required_args.<locals>.inner.<locals>.wrapper(*args, **kwargs)
    278     raise TypeError(msg)
--> 279 return func(*args, **kwargs)

File ~/code/testing/.conda/lib/python3.11/site-packages/openai/resources/chat/completions/completions.py:879, in Completions.create(self, messages, model, audio, frequency_penalty, function_call, functions, logit_bias, logprobs, max_completion_tokens, max_tokens, metadata, modalities, n, parallel_tool_calls, prediction, presence_penalty, reasoning_effort, response_format, seed, service_tier, stop, store, stream, stream_options, temperature, tool_choice, tools, top_logprobs, top_p, user, extra_headers, extra_query, extra_body, timeout)
    878 validate_response_format(response_format)
--> 879 return self._post(
    880     "/chat/completions",
    881     body=maybe_transform(
    882         {
    883             "messages": messages,
    884             "model": model,
    885             "audio": audio,
    886             "frequency_penalty": frequency_penalty,
    887             "function_call": function_call,
    888             "functions": functions,
    889             "logit_bias": logit_bias,
    890             "logprobs": logprobs,
    891             "max_completion_tokens": max_completion_tokens,
    892             "max_tokens": max_tokens,
    893             "metadata": metadata,
    894             "modalities": modalities,
    895             "n": n,
    896             "parallel_tool_calls": parallel_tool_calls,
    897             "prediction": prediction,
    898             "presence_penalty": presence_penalty,
    899             "reasoning_effort": reasoning_effort,
    900             "response_format": response_format,
    901             "seed": seed,
    902             "service_tier": service_tier,
    903             "stop": stop,
    904             "store": store,
    905             "stream": stream,
    906             "stream_options": stream_options,
    907             "temperature": temperature,
    908             "tool_choice": tool_choice,
    909             "tools": tools,
    910             "top_logprobs": top_logprobs,
    911             "top_p": top_p,
    912             "user": user,
    913         },
    914         completion_create_params.CompletionCreateParams,
    915     ),
    916     options=make_request_options(
    917         extra_headers=extra_headers, extra_query=extra_query, extra_body=extra_body, timeout=timeout
    918     ),
    919     cast_to=ChatCompletion,
    920     stream=stream or False,
    921     stream_cls=Stream[ChatCompletionChunk],
    922 )

File ~/code/testing/.conda/lib/python3.11/site-packages/openai/_base_client.py:1290, in SyncAPIClient.post(self, path, cast_to, body, options, files, stream, stream_cls)
   1287 opts = FinalRequestOptions.construct(
   1288     method="post", url=path, json_data=body, files=to_httpx_files(files), **options
   1289 )
-> 1290 return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls))

File ~/code/testing/.conda/lib/python3.11/site-packages/openai/_base_client.py:967, in SyncAPIClient.request(self, cast_to, options, remaining_retries, stream, stream_cls)
    965     retries_taken = 0
--> 967 return self._request(
    968     cast_to=cast_to,
    969     options=options,
    970     stream=stream,
    971     stream_cls=stream_cls,
    972     retries_taken=retries_taken,
    973 )

File ~/code/testing/.conda/lib/python3.11/site-packages/openai/_base_client.py:1071, in SyncAPIClient._request(self, cast_to, options, retries_taken, stream, stream_cls)
   1070     log.debug("Re-raising status error")
-> 1071     raise self._make_status_error_from_response(err.response) from None
   1073 return self._process_response(
   1074     cast_to=cast_to,
   1075     options=options,
   (...)
   1079     retries_taken=retries_taken,
   1080 )

BadRequestError: Error code: 400 - {'error': {'message': 'Unrecognized request argument supplied: max_completion_tokens', 'type': 'invalid_request_error', 'param': None, 'code': None}}

During handling of the above exception, another exception occurred:

AzureOpenAIError                          Traceback (most recent call last)
File ~/code/testing/.conda/lib/python3.11/site-packages/litellm/main.py:1269, in completion(model, messages, timeout, temperature, top_p, n, stream, stream_options, stop, max_completion_tokens, max_tokens, modalities, prediction, audio, presence_penalty, frequency_penalty, logit_bias, user, reasoning_effort, response_format, seed, tools, tool_choice, logprobs, top_logprobs, parallel_tool_calls, deployment_id, extra_headers, functions, function_call, base_url, api_version, api_key, model_list, **kwargs)
   1268     ## COMPLETION CALL
-> 1269     response = azure_chat_completions.completion(
   1270         model=model,
   1271         messages=messages,
   1272         headers=headers,
   1273         api_key=api_key,
   1274         api_base=api_base,
   1275         api_version=api_version,
   1276         api_type=api_type,
   1277         dynamic_params=dynamic_params,
   1278         azure_ad_token=azure_ad_token,
   1279         azure_ad_token_provider=azure_ad_token_provider,
   1280         model_response=model_response,
   1281         print_verbose=print_verbose,
   1282         optional_params=optional_params,
   1283         litellm_params=litellm_params,
   1284         logger_fn=logger_fn,
   1285         logging_obj=logging,
   1286         acompletion=acompletion,
   1287         timeout=timeout,  # type: ignore
   1288         client=client,  # pass AsyncAzureOpenAI, AzureOpenAI client
   1289     )
   1291 if optional_params.get("stream", False):
   1292     ## LOGGING

File ~/code/testing/.conda/lib/python3.11/site-packages/litellm/llms/azure/azure.py:545, in AzureChatCompletion.completion(self, model, messages, model_response, api_key, api_base, api_version, api_type, azure_ad_token, azure_ad_token_provider, dynamic_params, print_verbose, timeout, logging_obj, optional_params, litellm_params, logger_fn, acompletion, headers, client)
    544     error_headers = getattr(error_response, "headers", None)
--> 545 raise AzureOpenAIError(
    546     status_code=status_code, message=str(e), headers=error_headers
    547 )

AzureOpenAIError: Error code: 400 - {'error': {'message': 'Unrecognized request argument supplied: max_completion_tokens', 'type': 'invalid_request_error', 'param': None, 'code': None}}

During handling of the above exception, another exception occurred:

BadRequestError                           Traceback (most recent call last)
Cell In[6], line 4
      1 import litellm
      3 # azure call
----> 4 response = litellm.completion(
      5     model = "azure/gpt-4o",
      6     api_base = "https://XXX.openai.azure.com/",
      7     api_version = "2024-08-01-preview",
      8     api_key = "XXX",
      9     messages = [{"role": "user", "content": "good morning"}],
     10     max_completion_tokens=100,
     11 )
     13 print(response)

File ~/code/testing/.conda/lib/python3.11/site-packages/litellm/utils.py:1190, in client.<locals>.wrapper(*args, **kwargs)
   1186 if logging_obj:
   1187     logging_obj.failure_handler(
   1188         e, traceback_exception, start_time, end_time
   1189     )  # DO NOT MAKE THREADED - router retry fallback relies on this!
-> 1190 raise e

File ~/code/testing/.conda/lib/python3.11/site-packages/litellm/utils.py:1068, in client.<locals>.wrapper(*args, **kwargs)
   1066         print_verbose(f"Error while checking max token limit: {str(e)}")
   1067 # MODEL CALL
-> 1068 result = original_function(*args, **kwargs)
   1069 end_time = datetime.datetime.now()
   1070 if "stream" in kwargs and kwargs["stream"] is True:

File ~/code/testing/.conda/lib/python3.11/site-packages/litellm/main.py:3085, in completion(model, messages, timeout, temperature, top_p, n, stream, stream_options, stop, max_completion_tokens, max_tokens, modalities, prediction, audio, presence_penalty, frequency_penalty, logit_bias, user, reasoning_effort, response_format, seed, tools, tool_choice, logprobs, top_logprobs, parallel_tool_calls, deployment_id, extra_headers, functions, function_call, base_url, api_version, api_key, model_list, **kwargs)
   3082     return response
   3083 except Exception as e:
   3084     ## Map to OpenAI Exception
-> 3085     raise exception_type(
   3086         model=model,
   3087         custom_llm_provider=custom_llm_provider,
   3088         original_exception=e,
   3089         completion_kwargs=args,
   3090         extra_kwargs=kwargs,
   3091     )

File ~/code/testing/.conda/lib/python3.11/site-packages/litellm/litellm_core_utils/exception_mapping_utils.py:2202, in exception_type(model, original_exception, custom_llm_provider, completion_kwargs, extra_kwargs)
   2200 if exception_mapping_worked:
   2201     setattr(e, "litellm_response_headers", litellm_response_headers)
-> 2202     raise e
   2203 else:
   2204     for error_type in litellm.LITELLM_EXCEPTION_TYPES:

File ~/code/testing/.conda/lib/python3.11/site-packages/litellm/litellm_core_utils/exception_mapping_utils.py:1949, in exception_type(model, original_exception, custom_llm_provider, completion_kwargs, extra_kwargs)
   1947 elif "invalid_request_error" in error_str:
   1948     exception_mapping_worked = True
-> 1949     raise BadRequestError(
   1950         message=f"AzureException BadRequestError - {message}",
   1951         llm_provider="azure",
   1952         model=model,
   1953         litellm_debug_info=extra_information,
   1954         response=getattr(original_exception, "response", None),
   1955     )
   1956 elif (
   1957     "The api_key client option must be set either by passing api_key to the client or by setting"
   1958     in error_str
   1959 ):
   1960     exception_mapping_worked = True

Are you a ML Ops Team?

No

What LiteLLM version are you on ?

v1.61.17

Twitter / LinkedIn details

https://www.linkedin.com/in/panahi

c3-ali avatar Feb 26 '25 22:02 c3-ali

is the ask to map max_completion_tokens to max_tokens like we do on the older openai models? @c3-ali

krrishdholakia avatar Feb 27 '25 07:02 krrishdholakia

@krrishdholakia yes.

c3-ali avatar Feb 27 '25 17:02 c3-ali

added to roadmap

ishaan-jaff avatar Mar 10 '25 15:03 ishaan-jaff

I couldnt replicate and this PR https://github.com/BerriAI/litellm/pull/6376 is merged, I take it this is solved @c3-ali ?

CakeCrusher avatar Mar 14 '25 09:03 CakeCrusher

Hmm i don't see code that fixes this on main - https://github.com/BerriAI/litellm/blob/3875df666b8a11819eda86fdc35d582be4bd8db6/litellm/llms/azure/chat/gpt_transformation.py#L4

krrishdholakia avatar Mar 14 '25 14:03 krrishdholakia

@krrishdholakia I not sure exactly how but I would suspect Azure added summopt for max_completion_tokens I was testing with this https://github.com/CakeCrusher/litellm/commit/66167011f35b88cf407fae4169d00ac567b8b1a8#diff-4b41e1a3e65a7327300b012d7e6c49acb8daf8ae0412f76a64a05d5b999b2783 and it was working now.

At the end of the day it was an inconsistency on their end.

CakeCrusher avatar Mar 15 '25 04:03 CakeCrusher

@CakeCrusher yes, this is inconsistency from azure and I randomly (similar to others) get the above issue for our deployment. I created a new azure deployment and do not experience this issue anymore. I'm ok if we close this issue.

c3-ali avatar Apr 08 '25 04:04 c3-ali