outlines
outlines copied to clipboard
Fix Infinite Repetition in JSON Schemas Using Integer and String
Overview
The repetition problem of language models combined with patterns allowing for infinite-length fields results in broken JSON Schema outputs.
This was addressed previously for infinite whitespaces issues by setting a safe whitespace pattern as the default. In this PR, the safety of whitespaces is extended to Integer and String patterns.
Behavior
json_schema.to_regex now includes a kwarg safe_subset=True.
safe_subset=False
- Whitespace:
r"[\n\t ]*" - Integer: any number
- String: any string
safe_subset=True (default)
- Whitespace:
r"[ ]?" - Integer: (-1e19, 1e19)
- String: Any string of length (0, 256)
Fixes
Safe Integer
- Fixes https://github.com/outlines-dev/outlines/issues/1110
- Fixes https://github.com/dottxt-ai/outlines/issues/1099
Safe String
- Fixes https://github.com/dottxt-ai/outlines/issues/1075
- Addresses https://github.com/dottxt-ai/outlines/issues/985 (doesn't fix, they requested a
non_strictmode) - Fixes https://github.com/outlines-dev/outlines/issues/1106
Further Work
-
Important: In the resolved issues, the incorrect outputs are often caused by not applying a chat template. Let's help users get great completions. Examples should include chat templates, or user response quality will suffer. https://github.com/outlines-dev/outlines/issues/987
-
Make code more failsafe: https://github.com/outlines-dev/outlines/issues/985
-
numberhas nosafe_subsetimplementation. It's likely the only unsafe primitive remaining without asafe_subsetimplementation. However, there aren't any open issues for an error caused by number.
- Important: In the resolved issues, the incorrect outputs are often caused by not applying a chat template. Let's help users get great completions. Examples should include chat templates, or user response quality will suffer. Format
promptsUsing Chat Templates inSequenceGeneratorAdapter#987
@cpfiffer fyi
We might want to hold off on this one actually. I did some profiling on get_str_pattern. Constrained strings have a large state and take a long time to compile.
>>> len(interegular.parse_pattern(STRING_INNER + "{,256}").to_fsm().reduce().states)
513
>>> len(interegular.parse_pattern(STRING_INNER + "*").to_fsm().reduce().states)
2
The better alternative is to
- now: Update this PR, enable
safe_subsetas an optional parameter non-default parameter - soon: Reduce failure rate via
- update docs, recommend pydantic
constrby default, notstr - https://github.com/dottxt-ai/outlines/issues/987
- https://github.com/dottxt-ai/outlines/issues/985
- update docs, recommend pydantic
- later: integrate this functionality into an automata which has a state stack, CFG or otherwise.
Alternatively we could reduce the size of safe_subset str to something like 20 instead of 256.
Let me know if this makes sense