Translate JSON Schema to YAML regex
This is a tentative implementation of a regex generator for arbitrary YAML given a JSON Schema. This PR relates to #923 .
There are still some issues:
- the NUMBER regex can be revisited. I have taken it from the official pyyaml repo link. The INTEGER regex can also be changed.
- I don't address the refactoring suggested in #923. This results in some code deduplication between both files.
- The regex generator is not very strict regarding the indentation levels, which might make it impractical in some use cases.
@rlouf @lapp0
Thank you for the PR! It looks like many tests related to these changes are failing…
I have already copied some tests from test_json_schema.py and updated them for the YAML use case, but I am open to reusing the code as you have done it (or simply add more tests from the test_json file and edit them).
Given that you're refactoring some of the code on your branch, what is the easiest way to go forward to minimize the amount of duplication / deprecation of code?
I have already copied some tests from
test_json_schema.pyand updated them for the YAML use case, but I am open to reusing the code as you have done it (or simply add more tests from the test_json file and edit them).Given that you're refactoring some of the code on your branch, what is the easiest way to go forward to minimize the amount of duplication / deprecation of code?
I've refactored json_schema.py, but unless you're interested in incorporating those changes, we can just focus on test_json_schema.py for now. Simply replacing test_json_schema.py on your branch with my branches version and ensuring it works is sufficient. The new module simply ensures the tested behavior in json_schema.py is matched by yaml_schema.py.
Please let me know if you have any other questions.
@patricebechard is there anything I can do to help with this?
sorry, was quite busy lately, but I can work on it this week, will let you know if I need help with anything
No worries at all, thanks for your continued work!
I was finally able to make some changes including support for indentation.
Some caveats:
- I am currently skipping some tests as the behavior between yaml and json differs for some cases (e.g. the datetimes without quotes)
- the implementation differs from what @lapp0 has on his branch, which would mean we would have to do some refactoring at some point.
I am also making sure that we support both quoted and unquoted strings for YAML. Since one of the main advantages of using YAML for guided generation is that there are less tokens, if we add a double quote every time there is a string, we end up with a generation which is almost as big as the one obtained in JSON, so transitioning to YAML would not make sense.
I am also making sure that we support both quoted and unquoted strings for YAML. Since one of the main advantages of using YAML for guided generation is that there are less tokens, if we add a double quote every time there is a string, we end up with a generation which is almost as big as the one obtained in JSON, so transitioning to YAML would not make sense.
We may want to smoke test qualitative generation performance when using YAML. Out of scope for this PR, but disallowing quotes may, in some cases, confuse the model.
the implementation differs from what @lapp0 has on his branch, which would mean we would have to do some refactoring at some point.
This is fine for now.
Thanks for getting this working! Please let me know if this PR is ready for review.
Not exactly sure how the coverage is computed here. It says the coverage for yaml_schema.py is ~1% although it should be higher. Any idea how to remedy this? This did not happen previously from what I understand.
Change in coverage may be related to https://github.com/outlines-dev/outlines/pull/1089
I've created a separate issue to address this problem https://github.com/outlines-dev/outlines/issues/1105
Hi @patricebechard! We recently released the v1 of Outlines. Outlines has changed a bit since you last worked on this PR. There are in particular more operations that are now handled by outlines-core. What you want to do in this PR is very similar to the function build_regex_from_schema in outlines_core. If you're still interested by this topic, it could be interesting for you to go see how it works.