outlines Adds option for JSON schema optimization

Pydantic's .model_json_schema() and get_schema_from_signature don't actually make optional fields/arguments optional in the json schema. This forces the model to output the keys even when the values are null anyway--slowing down inference the larger the schema is & the more optional fields there is.

For example, for this Pydantic class:

class Test(BaseModel):
    field_a: int
    field_b: Optional[int]
    field_c: None

.model_json_schema() builds this schema:

{
    "properties": {
        "field_a": {"title": "Field A", "type": "integer"},
        "field_b": {
            "anyOf": [{"type": "integer"}, {"type": "null"}],
            "title": "Field B",
        },
        "field_c": {"title": "Field C", "type": "null"},
    },
    "required": ["field_a", "field_b", "field_c"],
    "title": "Test",
    "type": "object",
}

optimize_schema in this PR reduces this to:

{
    "properties": {
        "field_a": {"title": "Field A", "type": "integer"},
        "field_b": {"title": "Field B", "type": "integer"},
    },
    "required": ["field_a"],
    "title": "Test",
    "type": "object",
}

Likewise, get_schema_from_signature converts this function:

def test_add(a: int, b: int | None = None):
    if b is None:
        return a
    return a + b

to this schema:

{
    "properties": {
        "a": {"title": "A", "type": "integer"},
        "b": {
            "anyOf": [{"type": "integer"}, {"type": "null"}],
            "title": "B",
        },
    },
    "required": ["a", "b"],
    "title": "Arguments",
    "type": "object",
}

optimize_schema reduces this to:

{
    "properties": {
        "a": {"title": "A", "type": "integer"},
        "b": {"title": "B", "type": "integer"},
    },
    "required": ["a"],
    "title": "Arguments",
    "type": "object",
}

I decided to add a flag, enable_schema_optimization, and set it to False by default because it further restricts the support distribution and thus might break models finetuned without this setting.

May 04 '24 06:05 leloykun

There seems to be another potential bug here. Given the function

def test_add(a: int, b: int | None = None):
    if b is None:
        return a
    return a + b

the function get_schema_from_signature outputs "title": "Arguments" both when optimize_schema is used and when it is not used. It seems like the output should have "title": "test_add".

Perhaps I should raise this in a separate issue.

May 06 '24 21:05 eitanturok

@eitanturok I don't think this is a bug cuz we don't use the title field when building the FSM (& when generating outputs)

Can you provide an example where this breaks something?

May 06 '24 22:05 leloykun

@leloykun

I'm using outlines to make my models better at function calling and this current setup causes me some issues.

At a high level, I take the generated schema and use it 1) for the system prompt and 2) to create a regex. I input this schema into the system prompt so it knows which functions it has access to. But if the json schema does NOT contain the function's name, the model won't know how to call it.

Here is an example:


def test_add(a: int, b: int | None = None):
    if b is None:
        return a
    return a + b
    
schema_json = get_schema_from_signature(tool)
schema_str = json.dumps(schema_json).strip()
schema_regex = build_regex_from_schema(schema_str, whitespace_pattern)

system_prompt = f"You are an expert at function calling and have access to the following tools: {function_schema}."
system_prompt += "Please call one of these functions."
system_prompt = system_prompt.format(schema_str)

generator = generate.regex(model, schema_regex)

If the function name is not included in the schema generated from get_schema_from_signature then this causes issues to arise.

May 07 '24 17:05 eitanturok

@eitanturok, we should raise this as a separate issue

I'm thinking of replacing this line in get_schema_from_signature

model = create_model("Arguments", **arguments)

with

model = create_model(fn.__name__, **arguments)

or

try:
    fn_name = fn.__name__
except Exception as e:
    fn_name = "Arguments"
model = create_model(fn_name, **arguments)

just to be safer

what do you think?

May 07 '24 17:05 leloykun

I was thinking the same thing. I'll raise this a separate issue.

May 08 '24 01:05 eitanturok

Raised the issue in #878. Future discussions should take place there.

May 08 '24 03:05 eitanturok