outlines
outlines copied to clipboard
Refactor `json_schema.py`, implement JSON Schema to YAML
- Fixes https://github.com/dottxt-ai/outlines/issues/923
Overview
Refactor json_schema.py to be more coherent and extensible. Use extensibility to implement JSON Schema to YAML.
Changes
- Convert large function
to_regexinto a classJSONSchemaRegexGeneratorwith visitors which implement JSON Schema rules, and formatters which implement pattern construction. - Implement
YAMLRegexGeneratorby subclassingJSONSchemaRegexGeneratorand overriding some formatters.
Tests:
- Update
test_json_schema.pyso it's existing tests also apply to YAML.- No changes to these tests otherwise to ensure stable behavior (with the exception of fixes to
anyOfandallOf)
- No changes to these tests otherwise to ensure stable behavior (with the exception of fixes to
- Enable previously disabled
test_generate.py::test_generate_json, test both json and yaml modes. - Incorporate json-schema.org test suite (~1250 tests)
Behavioral Changes
The only behavior changes are:
- Convert features which result in incorrect patterns to
NotImplementedError - Fix
anyOf,allOf,oneOfanyOf: Previously broken, now ORs sub-patternsallOf: Previously broken, now ANDs sub-patterns via positive lookaheadoneOf: Warns user that it's usinganyOfinstead, and callsanyOf
The rules are much closer to the JSON Schema spec with main, however JSON Schema spec isn't always desirable. Users can legalize the JSON Schema compliant validation rules via strict_json_schema_subset=False, resulting in:
items: If unspecified, allow additional items without constraintsproperties: If unspecified, allow additional properties without constraints
json-schema.org test suite
This is a large change-set. To verify correctness, in addition to ensuring current tests pass, test_json_schema_full.py tests compliance with JSON Schema by retrieving 1,245 test cases from the official json-schema.org test suite.
| On this Branch | On main |
|
|---|---|---|
| Pattern invalid: FP & FN (bad: invisible) | 38 | 246 |
Raise NotImplementedError (acceptable: visible) |
944 | 693 |
| Pattern valid | 263 | 306 |
Raising NotImplementedError makes it clear to the user why a schema would fail during generation, and it does so before generation.
test_json_schema_to_yaml_compliance
For each of the 263 tests which pass in test_json_schema_to_json_compliance, we test to verify their corresponding yaml pattern is also correct.
TODO
- [x] Refactor
json_schemaso its clean and extensible - [x] Validate refactor integrity through extensive json-schema test suite
- [x] Implement JSON Schema to YAML
- [x] Apply
test_json_schema_full.pyto yaml - [x] Finish integrating patrice's patterns
- [x] Improve doc-strings
- [x] ~~Update docs to reflect new behaviour surrounding JSON Schema spec-compliant implementation~~
- No longer needed thanks to
strict_json_schema_subset
- No longer needed thanks to
Further Work
- Enable additional YAML subsets. This implementation can be easily extended to allow parameterization of how the YAML subset is defined. (e.g. quotes around the string or not, indentation rules, etc)
- Refactor because
json_schema.pydoes too much. This new structure makes separation of concerns clear, easing a refactor. - Implement:
JSONSchemaRegexGenerator.to_automata(...)Not using a pattern intermediate would simplify things. - Implement
NotImplementedcomponents based on users opening issues.