openai-python Apply more fixes for Pydantic schema incompatibilities with OpenAI structured outputs

Confirm this is a feature request for the Python library and not the underlying OpenAI API.

[X] This is a feature request for the Python library

Describe the feature or improvement you're requesting

I noticed that you guys are doing some manipulation of Pydantic's generated schema to ensure compatibility with the API's schema validation. I found a few more instances that can be addressed:

Issues:

optional fields with pydantic defaults generate an unsupported 'default' field in the schema
date fields generate a format='date-time' field in the schema which is not supported

The test cases below builds on your to_strict_json_schema function and removes addresses these problematic fields with the remove_property_from_schema function:

class Publisher(BaseModel):
    name: str = Field(description="The name publisher")
    url: Optional[str] = Field(None, description="The URL of the publisher's website")
    class Config:
        json_schema_extra = {
            "additionalProperties": False
        }

class Article(BaseModel):
    title: str = Field(description="The title of the news article")
    published: Optional[datetime] = Field(None, description="The date the article was published. Use ISO 8601 to format this value.")
    publisher: Optional[Publisher] = Field(None, description="The publisher of the article")
    class Config:
        json_schema_extra = {
            "additionalProperties": False
        }
        
class NewsArticles(BaseModel):
    query: str = Field(description="The query used to search for news articles")
    articles: List[Article] = Field(description="The list of news articles returned by the query")
    class Config:
        json_schema_extra = {
            "additionalProperties": False
        }
    

def test_schema_compatible():
    client = OpenAI()
    
    # build on the internals that the openai client uses to clean up the pydantic schema for the openai API
    schema = to_strict_json_schema(NewsArticles)
    
    # optional fields with pydantic defaults generate an unsupported 'default' field in the schema
    remove_property_from_schema(schema, "default")
    # date fields generate a format='date-time' field in the schema which is not supported
    remove_property_from_schema(schema, "format")
        
    logger.info("Generated Schema: %s", json.dumps(schema, indent=2))
    completion = client.beta.chat.completions.parse(
        model="gpt-4o-2024-08-06",
        temperature=0,
        messages=[
            {
                "role": "user",
                "content":  "What where the top headlines in the US for January 6th, 2021?",
            }
        ],
        response_format={
            "type": "json_schema",
            "json_schema": {
                "schema": schema,
                "name": "NewsArticles",
                "strict": True,
            }
        }
    )
    result = NewsArticles.model_validate_json(completion.choices[0].message.content)
    assert result is not None



def remove_property_from_schema(schema: dict, property_name: str):
    if 'properties' in schema:
        for field_name, field in schema['properties'].items():
            if 'properties' in field:
                remove_property_from_schema(field, property_name)
            if 'anyOf' in field: 
                for any_of in field['anyOf']:
                    any_of.pop(property_name, None)
            field.pop(property_name, None)
    if '$defs' in schema:                    
        for definition_name, definition in schema['$defs'].items():
            remove_property_from_schema(definition, property_name)

Additional context

No response

Aug 17 '24 17:08 mcantrell

@RobertCraigie Thanks for fixing one of the issues! Do you have an ETA on the fix for the "format" issue?

Aug 26 '24 19:08 micahstairs

There are currently no plans to automatically remove "format": "date-time" as it breaks .parse()'s promise that it will either generate valid data or refuse to generate any data.

We're considering opt-in flags to remove certain features that the API doesn't support yet but I don't have an ETA to share unfortunately.

Sep 24 '24 09:09 RobertCraigie

Currently typical users of openai-python's "structured output" feature often must resort to maintaining parallel sets of Pydantic classes, one for their own internal use (such as a proprietary API being provided) and one for interfacing with OpenAI that avoids those problematic Pydantic features. Other examples of problematic OpenAI features are: myfield: int = Field(ge=…, le=…), myfield: bool = False.

I understand fields with some of these features (such as min/max values) can't easily be degraded to become OpenAI compatible, but at least for those that can, it would be fantastic to at least have the flag described by @RobertCraigie to simply automatically remove them from the schema. Even the "format": "date-time" case could be addressed by using a library such as dateutil.parser to parse most date formats.

Anything that helps avoid having to maintain redundant model classes would be a huge win.

Nov 15 '24 20:11 jmehnle

Somewhat related, how feasible would it be to convert int fields with min/max restrictions and a sufficiently short range of values into enum validations, which OpenAI apparently supports? I have a bunch of fields that are "1–5" kind of ranges, and they could easily be expressed as "enum": [1, 2, 3, 4, 5].

Nov 15 '24 20:11 jmehnle