outlines icon indicating copy to clipboard operation
outlines copied to clipboard

Refactor `json_schema.py`, implement JSON Schema to YAML

Open lapp0 opened this issue 1 year ago • 1 comments

  • Fixes https://github.com/dottxt-ai/outlines/issues/923

Overview

Refactor json_schema.py to be more coherent and extensible. Use extensibility to implement JSON Schema to YAML.

Changes

  • Convert large function to_regex into a class JSONSchemaRegexGenerator with visitors which implement JSON Schema rules, and formatters which implement pattern construction.
  • Implement YAMLRegexGenerator by subclassing JSONSchemaRegexGenerator and overriding some formatters.

Tests:

  • Update test_json_schema.py so it's existing tests also apply to YAML.
    • No changes to these tests otherwise to ensure stable behavior (with the exception of fixes to anyOf and allOf)
  • Enable previously disabled test_generate.py::test_generate_json, test both json and yaml modes.
  • Incorporate json-schema.org test suite (~1250 tests)

Behavioral Changes

The only behavior changes are:

  • Convert features which result in incorrect patterns to NotImplementedError
  • Fix anyOf, allOf, oneOf
    • anyOf: Previously broken, now ORs sub-patterns
    • allOf: Previously broken, now ANDs sub-patterns via positive lookahead
    • oneOf: Warns user that it's using anyOf instead, and calls anyOf

The rules are much closer to the JSON Schema spec with main, however JSON Schema spec isn't always desirable. Users can legalize the JSON Schema compliant validation rules via strict_json_schema_subset=False, resulting in:

  • items: If unspecified, allow additional items without constraints
  • properties: If unspecified, allow additional properties without constraints

json-schema.org test suite

This is a large change-set. To verify correctness, in addition to ensuring current tests pass, test_json_schema_full.py tests compliance with JSON Schema by retrieving 1,245 test cases from the official json-schema.org test suite.

On this Branch On main
Pattern invalid: FP & FN (bad: invisible) 38 246
Raise NotImplementedError (acceptable: visible) 944 693
Pattern valid 263 306

Raising NotImplementedError makes it clear to the user why a schema would fail during generation, and it does so before generation.

test_json_schema_to_yaml_compliance

For each of the 263 tests which pass in test_json_schema_to_json_compliance, we test to verify their corresponding yaml pattern is also correct.

TODO

  • [x] Refactor json_schema so its clean and extensible
  • [x] Validate refactor integrity through extensive json-schema test suite
  • [x] Implement JSON Schema to YAML
  • [x] Apply test_json_schema_full.py to yaml
  • [x] Finish integrating patrice's patterns
  • [x] Improve doc-strings
  • [x] ~~Update docs to reflect new behaviour surrounding JSON Schema spec-compliant implementation~~
    • No longer needed thanks to strict_json_schema_subset

Further Work

  • Enable additional YAML subsets. This implementation can be easily extended to allow parameterization of how the YAML subset is defined. (e.g. quotes around the string or not, indentation rules, etc)
  • Refactor because json_schema.py does too much. This new structure makes separation of concerns clear, easing a refactor.
  • Implement: JSONSchemaRegexGenerator.to_automata(...) Not using a pattern intermediate would simplify things.
  • Implement NotImplemented components based on users opening issues.

lapp0 avatar Sep 30 '24 01:09 lapp0