jsonschema icon indicating copy to clipboard operation
jsonschema copied to clipboard

JSON Schema Regex issues

Open utx0 opened this issue 6 years ago • 7 comments

I am currently working with a JSON Schema that is shared with a frontend application for building and validating a data structure that then get's passed to my backend go service which also uses the schema for data validation. Now the issue I'm having is the schema in question is using the regex pattern keyword for data validation and this works fine when its in the js world, however when using this schema with this jsonschema package I get the following error:

panic: error unmarshaling properties from json: error unmarshaling items from json: json: cannot unmarshal object into Go value of type []*jsonschema.Schema

goroutine 1 [running]:
github.com/qri-io/jsonschema.Must(0xc000166000, 0x776, 0x776)
        /Users/metta/golang/pkg/mod/github.com/qri-io/[email protected]/schema.go:24 +0xbf

Now from my research go uses a different regex engine under the hood. And I am wondering how other deal with this when sharing a schema between different languages such as go and js?

utx0 avatar Apr 27 '19 02:04 utx0

Thankfully the spec has guidance on regex interop. Specifically:

given the high disparity in regular expression constructs support, schema authors SHOULD limit themselves to the following regular expression tokens:

individual Unicode characters, as defined by the JSON specification [RFC7159];
simple character classes ([abc]), range character classes ([a-z]);
complemented character classes ([^abc], [^a-z]);
simple quantifiers: "+" (one or more), "*" (zero or more), "?" (zero or one), and their lazy versions ("+?", "*?", "??");
range quantifiers: "{x}" (exactly x occurrences), "{x,y}" (at least x, at most y, occurrences), {x,} (x occurrences or more), and their lazy versions;
the beginning-of-input ("^") and end-of-input ("$") anchors;
simple grouping ("(...)") and alternation ("|").

So, basically, when writing a schema, try to keep the regex simple. But we can get a little more specific b/c you're dealing in go-to-js regex conversion. Javascript is supposed to use ECMA-262 regex, and go's regexp library is built to satisfy RE2. Wikipedia has a great comparison between the two:

  • https://en.wikipedia.org/wiki/Comparison_of_regular_expression_engines#Part_1
  • https://en.wikipedia.org/wiki/Comparison_of_regular_expression_engines#Part_2

Looks like the two biggest difference on the "basic features" are backreferences and look-ahead. So for your purposes it would be worth keeping those in mind. JS doesn't support any of the fancy (scary?) features on the second table, so we don't need to worry about those.

I consider it a bug that you didn't get a good, human-readable error when regex parsing fails due to these limitations. If you have any interest in filing a PR that adds a test and a better error message, I'd be deeply appreciative. A great place to start would be returning a better error message here: https://github.com/qri-io/jsonschema/blob/094d15abc20e286f43f992a2f032c05704fa5ab6/keywords_format.go#L258

b5 avatar Apr 27 '19 15:04 b5

The implementation uses package regexp to validate pattern properties. But the regexp package does not implement the full regexp syntax of ECMA-262 used in the JSON Schema spec. This means that jsonschema is not compliant.

This problem is not even documented in the README or in package documentation.

dolmen avatar Apr 28 '20 10:04 dolmen

I agree that this package is not 100% compliant with the spec, but not for this reason. The section from the spec you're quoting:

This string SHOULD be a valid regular expression, according to the ECMA 262 regular expression dialect.

I interpret the use of the word SHOULD in the above sentence to mean that both json schema implementers and spec authors SHOULD aim for ECMA-262. If fully implementing ECMA-262 were required to be considered part of the spec, I would expect the above sentence to use MUST instead of SHOULD.

Doing a full ECMA-262 spec implementation would require an extra import that isn't in the go standard library. We haven't had anyone ask for it yet, and I prefer to stick to standard library imports whenever possible, but if this is a major sticking point for people using this package, I'd be happy to talk through it.

This problem is not even documented in the README or in package documentation.

We'd gladly accept a pull request that documents the difference in regex parsing.

b5 avatar Apr 28 '20 22:04 b5

Yeah, regexes are a bit of an interoperability nightmare. As an example, you'll note that . (dot) is not included in that list of tokens that people should limit themselves to, yet it's one of the most commonly used regex tokens.

There's no great solution, the language in the spec is intended to indicate that JSON Schema inherits the same interoperability problems, and we pick one recommended standard with the understanding that none of the standards are universally supported.

handrews avatar Apr 29 '20 20:04 handrews

I also ran into this issue when using negative lookbehind in the regex pattern, and it took me a while to understand what the problem was.

Any chance we could get clearer error message when this occurs?

joshuacc avatar Sep 01 '22 22:09 joshuacc

The Otto project (a JavaScript interpreter) has a function that fixes some incompatibilities between Go regexp and JS regexp.

But that will not add lookbehind.

dolmen avatar Apr 26 '23 09:04 dolmen

Module github.com/dlclark/regexp2 has an ECMAScript mode. But it is important for safety (protect against DoS) to set timeouts, so the jsonschema API must allow to set regexp time limits.

dolmen avatar Apr 26 '23 11:04 dolmen