instructor icon indicating copy to clipboard operation
instructor copied to clipboard

Is there a way to parse JSON in non-strict mode?

Open voberoi opened this issue 10 months ago • 4 comments

Is your feature request related to a problem? Please describe. Anthropic's models regularly have control characters in their strings, producing invalid JSON that causes validation to fail.

Describe the solution you'd like I'd like the option to parse JSON by passing strict=False as the docs over here indicate I can.

CleanShot 2024-04-17 at 23 09 51@2x

Describe alternatives you've considered

I can't figure out how to do this or if this functionality you intend the library to have (or once did but don't anymore).

voberoi avatar Apr 18 '24 03:04 voberoi

I second

sigren avatar Apr 19 '24 19:04 sigren

It looks like instructor uses Pydantic's model_validate_json. It has a strict param which doesn't have the same semantics as strict in json.loads(...,strict=False)

It's possible to do model_validate_json(json.loads(...)) -- is that something you'd consider allowing?

It's less performant and it won't work in a streaming context, but it gives clients an out when models misbehave with control characters.

My proposal is something like:

  • Clients can pass in control_characters_allowed=True or something to instructor.from_{client}
  • If that param is True:
    • Streaming responses are not allowed.
    • JSON is parsed using json from the stdlib using strict=False before Pydantic validation.

Some other ideas:

  • Only do this as a falllback when the validation error is because JSON is invalid due to control characters in strings.
  • Give clients the option to fail the first time there's an error due to control characters (vs. going through all the retries) so they can recover with: https://github.com/jxnl/instructor/commit/339c22ec58abec1d425fe1d0556406c66721a5f5.

Or maybe your thought is to let clients handle this particular error. I'm doing that now, but I think it would be nice to bake handling this into instructor instead.

Links:

  • https://docs.pydantic.dev/2.7/concepts/json/
  • https://docs.pydantic.dev/latest/concepts/performance/#in-general-use-model_validate_json-not-model_validatejsonloads

voberoi avatar Apr 21 '24 02:04 voberoi

I'd actually rather expose strict=False to the create call's patch, let me try to spend 10 minutes on this, if you dont hear back i'd take a PR too!

jxnl avatar Apr 21 '24 19:04 jxnl

https://github.com/jxnl/instructor/pull/618 please take a look

jxnl avatar Apr 21 '24 19:04 jxnl

Hey! Thanks for your help on this.

Continuing the discussion here since #618 is merged.

I'm happy to provide a patch for this functionality.

What do you think of merging Pydantic and json.load's "strict" semantics as in this example here: https://github.com/jxnl/instructor/pull/618#issuecomment-2069506854

Or would you prefer to split these up?

voberoi avatar Apr 29 '24 13:04 voberoi

i think it makes sense esp since control characters are an issue

jxnl avatar Apr 29 '24 23:04 jxnl

Great -- will merge those semantics. I'll take a crack at a patch.

voberoi avatar May 01 '24 20:05 voberoi

Fixed in #644.

voberoi avatar May 01 '24 22:05 voberoi