outlines icon indicating copy to clipboard operation
outlines copied to clipboard

Fix Infinite Repetition in JSON Schemas Using Integer and String

Open lapp0 opened this issue 1 year ago • 2 comments

Overview

The repetition problem of language models combined with patterns allowing for infinite-length fields results in broken JSON Schema outputs.

This was addressed previously for infinite whitespaces issues by setting a safe whitespace pattern as the default. In this PR, the safety of whitespaces is extended to Integer and String patterns.

Behavior

json_schema.to_regex now includes a kwarg safe_subset=True.

safe_subset=False

  • Whitespace: r"[\n\t ]*"
  • Integer: any number
  • String: any string

safe_subset=True (default)

  • Whitespace: r"[ ]?"
  • Integer: (-1e19, 1e19)
  • String: Any string of length (0, 256)

Fixes

Safe Integer

  • Fixes https://github.com/outlines-dev/outlines/issues/1110
  • Fixes https://github.com/dottxt-ai/outlines/issues/1099

Safe String

  • Fixes https://github.com/dottxt-ai/outlines/issues/1075
  • Addresses https://github.com/dottxt-ai/outlines/issues/985 (doesn't fix, they requested a non_strict mode)
  • Fixes https://github.com/outlines-dev/outlines/issues/1106

Further Work

  • Important: In the resolved issues, the incorrect outputs are often caused by not applying a chat template. Let's help users get great completions. Examples should include chat templates, or user response quality will suffer. https://github.com/outlines-dev/outlines/issues/987

  • Make code more failsafe: https://github.com/outlines-dev/outlines/issues/985

  • number has no safe_subset implementation. It's likely the only unsafe primitive remaining without a safe_subset implementation. However, there aren't any open issues for an error caused by number.

lapp0 avatar Sep 16 '24 02:09 lapp0

@cpfiffer fyi

rlouf avatar Sep 17 '24 13:09 rlouf

We might want to hold off on this one actually. I did some profiling on get_str_pattern. Constrained strings have a large state and take a long time to compile.

>>> len(interegular.parse_pattern(STRING_INNER + "{,256}").to_fsm().reduce().states)
513
>>> len(interegular.parse_pattern(STRING_INNER + "*").to_fsm().reduce().states)
2

The better alternative is to

  • now: Update this PR, enable safe_subset as an optional parameter non-default parameter
  • soon: Reduce failure rate via
    • update docs, recommend pydantic constr by default, not str
    • https://github.com/dottxt-ai/outlines/issues/987
    • https://github.com/dottxt-ai/outlines/issues/985
  • later: integrate this functionality into an automata which has a state stack, CFG or otherwise.

Alternatively we could reduce the size of safe_subset str to something like 20 instead of 256.

Let me know if this makes sense

lapp0 avatar Sep 17 '24 17:09 lapp0