outlines-core icon indicating copy to clipboard operation
outlines-core copied to clipboard

Make sure we correctly support `additionalProperties` set to `True`

Open rlouf opened this issue 11 months ago • 5 comments

Setting the additionalProperties keyword to True should allow generation of properties whose name is not listed in properties, but it is currently not the case:

import json
import re
from outlines_core.fsm.json_schema import build_regex_from_schema

schema = {
    "$schema": "http://json-schema.org/draft-04/schema#",
    "type": "object",
    "properties": {
      "a": {
        "type": "string",
      }
    },
    "required": [
      "marathonJson"
    ],
    "additionalProperties": True
}

data = json.dumps({
    "a": "aaa",
})


regex_str = build_regex_from_schema(json.dumps(schema))
print(regex_str)
# '\\{([ ]?"a"[ ]?:[ ]?"([^"\\\\\\x00-\\x1F\\x7F-\\x9F]|\\\\["\\\\])*")?[ ]?\\}'

print(re.match(regex_str, data)
# <re.Match object; span=(0, 12), match='{"a": "aaa"}'>

But:

data = json.dumps({
    "a": "aaa",
    "id": "1"
})

print(re.match(regex_str, data)
# None

The immediate solution is to return an error when additionalProperties is set to True, to avoid surprises. Once that's done we can open an issue to add it to the implementation.

rlouf avatar Jan 28 '25 11:01 rlouf

Okay, so I ran a lot of tests, read the JSON Schema specification, and also looked into this related issue: https://github.com/dottxt-ai/outlines-core/issues/149

AFAIU the spec states that having "additionalProperties": true is redundant since it's the default behavior. A quick fix could have been to simply forbid "additionalProperties": false, adjust a few tests, and be done with it!

However, the issue that @rlouf pointed out is actually different. Even though the current implementation looks correct, there's still a problem that can be demonstrated with this minimal example:

import json
import re
from outlines_core.json_schema import build_regex_from_schema

schema = {
    "type": "object",
    "properties": {
        "a": {
            "type": "string",
        }
        # No need to even mention `additionalProperties`
    },
}
regex_str = build_regex_from_schema(json.dumps(schema))

data = json.dumps({
    "a": "aaa",
})
print(re.match(regex_str, data))

data = json.dumps({
    "a": "aaa",
    "id": "1"
})
print(re.match(regex_str, data))  # Shouldn't be `None`

The issue with this code is that it doesn't even trigger the line of code mentioned... IIUC the library default behavior is wrong, and that's what need to be changed, not just tweaking additionalProperties-related code!

yvan-sraka avatar Feb 19 '25 14:02 yvan-sraka

I think that it makes sense for generation to set the default value to False (as long as it's documented). OpenAI forces you to set it to False manually, but I think it does match people's intuition when it comes to generation.

rlouf avatar Feb 19 '25 17:02 rlouf

@yvan-sraka So to_regex is an entry point, then it branches out depending on the key. If both type & properties are defined, only properties will be executed (because it's a first match, type is almost the last): https://github.com/dottxt-ai/outlines-core/blob/main/src/json_schema/parsing.rs#L54-L62

And additionalProperties logic belongs to parse_type branch only, while parse_properties doesn't have additionalProperties logic at all.

In other words, if you change @rlouf example by dropping properties it will work:

schema = {
    "$schema": "http://json-schema.org/draft-04/schema#",
    "type": "object",
    },
    "required": [
      "marathonJson"
    ],
    "additionalProperties": True
}

So that's what this bug is about, parse_properties needs to include this additionalProperties logic from parse_type => parse_type_string => parse_object_type

torymur avatar Feb 19 '25 17:02 torymur

I think that it makes sense for generation to set the default value to False (as long as it's documented). OpenAI forces you to set it to False manually, but I think it does match people's intuition when it comes to generation.

I would also prefer an error that forces the user to set it to False explicitly; otherwise, it’s a bit unclear and counterintuitive! If we choose a different default than the JSON Schema specification, where should we document it to ensure users don’t miss it?

yvan-sraka avatar Feb 19 '25 17:02 yvan-sraka

"Unintuitive" if you know the JSON specification inside and out. We are not using the spec for what it was intended to do (parsing), so we can take small liberties when a default of the original spec doesn't match users' common use cases or mental models.

I am yet to meet anyone who complained that outlines does not generate properties that they did not specify in the schema. I bet most people's mental model correspond to Zod and Pydantic's "if it's not explicitly defined it's not allowed".

rlouf avatar Feb 19 '25 17:02 rlouf