Adds option for JSON schema optimization
Pydantic's .model_json_schema() and get_schema_from_signature don't actually make optional fields/arguments optional in the json schema. This forces the model to output the keys even when the values are null anyway--slowing down inference the larger the schema is & the more optional fields there is.
For example, for this Pydantic class:
class Test(BaseModel):
field_a: int
field_b: Optional[int]
field_c: None
.model_json_schema() builds this schema:
{
"properties": {
"field_a": {"title": "Field A", "type": "integer"},
"field_b": {
"anyOf": [{"type": "integer"}, {"type": "null"}],
"title": "Field B",
},
"field_c": {"title": "Field C", "type": "null"},
},
"required": ["field_a", "field_b", "field_c"],
"title": "Test",
"type": "object",
}
optimize_schema in this PR reduces this to:
{
"properties": {
"field_a": {"title": "Field A", "type": "integer"},
"field_b": {"title": "Field B", "type": "integer"},
},
"required": ["field_a"],
"title": "Test",
"type": "object",
}
Likewise, get_schema_from_signature converts this function:
def test_add(a: int, b: int | None = None):
if b is None:
return a
return a + b
to this schema:
{
"properties": {
"a": {"title": "A", "type": "integer"},
"b": {
"anyOf": [{"type": "integer"}, {"type": "null"}],
"title": "B",
},
},
"required": ["a", "b"],
"title": "Arguments",
"type": "object",
}
optimize_schema reduces this to:
{
"properties": {
"a": {"title": "A", "type": "integer"},
"b": {"title": "B", "type": "integer"},
},
"required": ["a"],
"title": "Arguments",
"type": "object",
}
I decided to add a flag, enable_schema_optimization, and set it to False by default because it further restricts the support distribution and thus might break models finetuned without this setting.
There seems to be another potential bug here. Given the function
def test_add(a: int, b: int | None = None):
if b is None:
return a
return a + b
the function get_schema_from_signature outputs "title": "Arguments" both when optimize_schema is used and when it is not used. It seems like the output should have "title": "test_add".
Perhaps I should raise this in a separate issue.
@eitanturok I don't think this is a bug cuz we don't use the title field when building the FSM (& when generating outputs)
Can you provide an example where this breaks something?
@leloykun
I'm using outlines to make my models better at function calling and this current setup causes me some issues.
At a high level, I take the generated schema and use it 1) for the system prompt and 2) to create a regex. I input this schema into the system prompt so it knows which functions it has access to. But if the json schema does NOT contain the function's name, the model won't know how to call it.
Here is an example:
def test_add(a: int, b: int | None = None):
if b is None:
return a
return a + b
schema_json = get_schema_from_signature(tool)
schema_str = json.dumps(schema_json).strip()
schema_regex = build_regex_from_schema(schema_str, whitespace_pattern)
system_prompt = f"You are an expert at function calling and have access to the following tools: {function_schema}."
system_prompt += "Please call one of these functions."
system_prompt = system_prompt.format(schema_str)
generator = generate.regex(model, schema_regex)
If the function name is not included in the schema generated from get_schema_from_signature then this causes issues to arise.
@eitanturok, we should raise this as a separate issue
I'm thinking of replacing this line in get_schema_from_signature
model = create_model("Arguments", **arguments)
with
model = create_model(fn.__name__, **arguments)
or
try:
fn_name = fn.__name__
except Exception as e:
fn_name = "Arguments"
model = create_model(fn_name, **arguments)
just to be safer
what do you think?
I was thinking the same thing. I'll raise this a separate issue.
Raised the issue in #878. Future discussions should take place there.