json-path-comparison icon indicating copy to clipboard operation
json-path-comparison copied to clipboard

Multiplication and division in Proposal A

Open glyn opened this issue 5 years ago • 31 comments

Proposal A lists multiplication and division as "to dos". I'm nervous that this is the start of scripting and once these operations are introduced, the proposal may be drawn into implementing some not particularly well scoped subset of JavaScript.

Can we instead agree that scripting is out of scope and drop multiplication and division? If not, how/where do we draw the line?

glyn avatar Jun 02 '20 15:06 glyn

I agree that multiplication and division seem a bit out of place and open an enticing door. It's already hard enough to verify that something isn't Turing-complete for safety purposes. Any operation that adds choice or change makes it more likely that unintended completeness will be achieved.

jchesterpivotal avatar Jun 02 '20 16:06 jchesterpivotal

@jchesterpivotal Perhaps that particular horse has already bolted since JSONPath filters include logical expressions. See accidentally Turing complete.

glyn avatar Jun 02 '20 20:06 glyn

Even now Turing's ghost taunts us. It's enough to drive a man to Church.

jchesterpivotal avatar Jun 02 '20 22:06 jchesterpivotal

Full agree here.

The TODO part I think points to the fact that Proposal A does return a valid response (an empty list) instead of NOT_SUPPORTED. The problem is that @.key/10 is parsed as a valid dot notation, which I think it shouldn't.

Hi @jchesterpivotal, thanks for chiming in!

cburgmer avatar Jun 03 '20 19:06 cburgmer

Thanks @glyn, please keep those issues coming!

cburgmer avatar Jun 03 '20 19:06 cburgmer

The problem is that @.key/10 is parsed as a valid dot notation, which I think it shouldn't.

Hmmm, I'm not convinced, for three reasons.

Firstly, the same would then have to apply to $.key/10 for consistency. Jayway and Goessner (online here), as well as yaml-jsonpath (currently online here) match $.key/10 in:

{"key/10": "x"}

Going back to @.key/10 the results aren't so consistent. yaml-jsonpath matches $[?(@.key/10)] in:

[{"key/10": "x"}]

whereas Jayway fails to parse the path with Could not find matching close for / when parsing regex in : $[?(@.key/10)] (!) and Goessner parses the path but fails to match it.

That said, the above could be handled using bracket child notation which presumably should accept arithmetic operators.

Secondly, it seems somewhat arbitrary to reject just the basic four arithmetic operators.

Thirdly, limiting the check to these four operators effectively acknowledges the existence of implementations which support these operators, but not of other implementations with a broader set of operators.

glyn avatar Jun 04 '20 05:06 glyn

Yes, I see the problem here. Maybe this one is best tackled by not looking at the consensus too strictly, but come up with a good reasoning to which characters are allowed, and which ones are not (and why, of course).

The initial Regex was trying to exclude characters that would interfere with the existing parsing rules, but then I felt being too generous was making it harder for future additions to the syntax.

cburgmer avatar Jun 04 '20 06:06 cburgmer

Yes, I see the problem here. Maybe this one is best tackled by not looking at the consensus too strictly, but come up with a good reasoning to which characters are allowed, and which ones are not (and why, of course).

+1

The initial Regex was trying to exclude characters that would interfere with the existing parsing rules, but then I felt being too generous was making it harder for future additions to the syntax.

I have precisely the same feeling and yet JSON is pretty permissive. Thankfully, we have bracket child notation as a fall-back if we end up being a bit too restrictive on the dot child front. I would have liked to keep the child name syntax uniform between bracket child and dot child, but I've come to realise this is unrealistic and that the bracket child notation was almost certainly introduced precisely to support a broader character set in child names.

As you observe, it's easier to relax restrictions in future than to tighten them up, so as long as we don't rule out a big chunk of current usages, we can probably be pretty restrictive.

How about starting with identifier syntax?

glyn avatar Jun 04 '20 08:06 glyn

We have a consensus in non-ASCII symbols, so we should probably start from supporting Unicode. And Unicode has a good standard for identifiers: https://unicode.org/reports/tr31/

remorhaz avatar Jun 04 '20 09:06 remorhaz

I have something like this in my implementation's lexer for a NAME token, for example:

/[\p{ID_Start}_][\p{ID_Continue}\$\-@]*/

remorhaz avatar Jun 04 '20 09:06 remorhaz

That's amazing @remorhaz I still need to read through the entirety of this document, but my understanding is that we could rely solely on Unicode to define this for us?

I've applied this locally to the Proposal A's grammar without your modifications to the two classes:

DotChildName
  = $(char:.&{ return /\p{ID_Start}/u.test(char) })
    $((char:.&{ return /\p{ID_Continue}/u.test(char) })*)
    { return text(); }

This will fail some of the queries:

  • $.key-dash
  • $.$
  • $.2
  • $.-1
  • $.2
  • $[?(@.key-50==-100)]

cburgmer avatar Jun 04 '20 16:06 cburgmer

I've come to realise this is unrealistic and that the bracket child notation was almost certainly introduced precisely to support a broader character set in child names

+1

cburgmer avatar Jun 04 '20 16:06 cburgmer

is that we could rely solely on Unicode to define this for us? I'd say, we can rely on it (and probably other Unicode classes) as a basis.

See EcmaScript identifier description for an example. They combine UnicodeLetter non-terminal (combined of several Unicode properties) with some selected ASCII chars to construct their own IdentifierStart. We can do the same using either ID_Start/ID_Continue properties or using another property set. In fact, the latter could be better because JsonPath "identifiers" are not exactly what identifiers are in most languages. My point is just using any Unicode properties we need if possible instead of inveniting our own character sets from scratch.

remorhaz avatar Jun 08 '20 12:06 remorhaz

Following on from the idea of only allowing some kind of identifier as a dotted child name:

This will fail some of the queries:

  • $.key-dash
  • $.$
  • $.2
  • $.-1
  • $[?(@.key-50==-100)]

The child names like key-dash and 2 look likely to crop up in the wild.

We might be able to allow leading digits so that $.2 is valid.

But I don't see how we can allow $.key-dash without also allowing $.key-50 (and therefore allowing @.key-50 too).

Also, I don't see how we can allow $.-1 without also allowing $.key-50.

Not sure about $ in dotted child names.

glyn avatar Jun 08 '20 13:06 glyn

We might be able to allow leading digits so that $.2 is valid.

Do note that in the current implementation of Proposal A, $.2 would only return a value for {"2": 42}, not for [0, 1, 42].

cburgmer avatar Jun 08 '20 16:06 cburgmer

Not sure about $ in dotted child names.

There's no consensus on it, though it's not difficult to implement and I'd rather vote for it.

remorhaz avatar Jun 08 '20 17:06 remorhaz

I feel we should try to allow all common identifier patterns found across the programming languages that currently implement JSONPath (if that's possible without overlapping with syntax elements too much). $ and - for example seem to be used a lot in some languages, it might make sense to go look for some more.

cburgmer avatar Jun 08 '20 19:06 cburgmer

Maybe it's worth trying to use /\p{ID_Continue}+/ as a basis? At least it will allow such identifiers as $.2. Since we don't really need to distinguish between numeric literals and identifiers here, we can avoid using ID_Start for first character of identifier, and probably we can use the same pattern for any character of identifier.

remorhaz avatar Jun 09 '20 06:06 remorhaz

Sounds reasonable if ID_Continue includes -. Does it?

glyn avatar Jun 09 '20 07:06 glyn

Sounds reasonable if ID_Continue includes -. Does it?

No, it doesn't. We should add any additional symbols explicitly to the symbol class, like /[\p{ID_Continue}\-]+/

remorhaz avatar Jun 09 '20 11:06 remorhaz

What's important to know about ID_* - these properties already exclude lots of symbols, that are technically letters, but are either narrow-specific (historic alphabets) either not a good choice for an ID (vertical scripts like Mongolian, etc.).

Unicode standard describes identifiers in the following way:

...identifier consists of a string of characters beginning with a letter or an ideograph, and followed by any number of letters, ideographs, digits, or underscores.

In JsonPath we probably don't need to restrict the first symbol, so we can just consider identifier as "non-empty sequence of letters, ideographs, digits, uderscores AND $, -, etc.". Unicode standard allows us to act this way:

...(implementation) shall declare that it uses a profile and define that profile with a precise specification of the characters that are added to or removed from the sets of code points defined by... properties.

remorhaz avatar Jun 09 '20 12:06 remorhaz

But I don't see how we can allow $.key-dash without also allowing $.key-50 (and therefore allowing @.key-50 too).

Maybe we should instead define a simple set of characters which are not allowed in an identifier, like $ @ + - * / & | ! < = > [ ] ( ) because they have a special meaning in jsonpath and thus any identifier containg those should be referred to using bracket notation? It's always easier to look though a short list of "forbidden" characters than to google a set of /\p{ID_Continue}+/ and try to figure out if your identifier matches it or not.

bhmj avatar Aug 03 '20 01:08 bhmj

But I don't see how we can allow $.key-dash without also allowing $.key-50 (and therefore allowing @.key-50 too).

Maybe we should instead define a simple set of characters which are not allowed in an identifier, like $ @ + - * / & | ! < = > [ ] ( ) because they have a special meaning in jsonpath and thus any identifier containg those should be referred to using bracket notation? It's always easier to look though a short list of "forbidden" characters than to google a set of /\p{ID_Continue}+/ and try to figure out if your identifier matches it or not.

That approach would allow letters, which according to a comment above:

are not a good choice for an ID (vertical scripts like Mongolian, etc.)

glyn avatar Aug 03 '20 05:08 glyn

But maybe that's not a great problem. The more I think, the more I agree that we should avoid Unicode properties in standard. I know what I'm talking about - I have implemented Unicode properties manually and the solution includes seeking ~600 ranges of symbols (which may require pre-building a b-tree index!)

Mongolian letters and cuneiform poorly fit in strings, too; yet JSON allows them and no one seems to suffer.

remorhaz avatar Aug 03 '20 09:08 remorhaz

This approach leads us to /\w+/ character set for dot notated identifiers and the rest is covered by brackets. Problem solved )

bhmj avatar Aug 03 '20 11:08 bhmj

Which definition of \w do you have in mind? Python's definition depends on the current locale, for example.

glyn avatar Aug 03 '20 11:08 glyn

The classic [a-zA-Z0-9_] as with re.ASCII parameter.

bhmj avatar Aug 03 '20 12:08 bhmj

I really like the idea of keeping this simple rather than introducing a "cottage industry" of identifier parsing. But I suspect that omitting - would break many existing queries. How about using the character set [a-zA-Z0-9_-] instead?

glyn avatar Aug 03 '20 14:08 glyn

Well, adding - to the character set will make it impossible to use arithmetic for filter expressions like $.foo[?(@.key-5 > 10)]. Are we ready to reject even simplest filter arithmetic at all? Alternatively we could oblige to use spaces to separate math operators and identifiers, but that is pretty tough and will bring A LOT of confusion: @.key-5 > 10 vs @.key - 5 > 10. So I suggest sticking with [a-zA-Z0-9_].

bhmj avatar Aug 03 '20 14:08 bhmj

I agree the alternative of using spaces to separate math operators and identifiers isn't desirable.

So there is a choice between allowing - in dotted child names and allowing arithmetic expressions in filters.

I guess Proposal A could go with [a-zA-Z0-9_] for now and adjust this later if necessary.

@cburgmer How would you like to play this? Should Proposal A try to lead the standard or follow it later?

glyn avatar Aug 03 '20 15:08 glyn