jaq icon indicating copy to clipboard operation
jaq copied to clipboard

Parsing zero-padded numbers

Open kklingenberg opened this issue 11 months ago • 6 comments

While parsing zero-padded numbers I came across this minor issue. This is a minimal example:

$ echo "0012" | jaq .
0
0
12

Whereas jq yields just 12.

This is serde_json at work, which in turn is probably following JSON's spec (is my guess). This is another view at the issue:

$ echo "0012" | jaq -R fromjson 
Error: cannot parse 0012 as JSON: end of file expected

Also, the lexer rejects these numbers too (which is fine, and consistent with the JSON parser). jq is also consistent with its lenient parser:

$ jaq -n '0012'
Error: Unexpected token, expected as, *, +=, /=, %=, >=, /, ?, %, and, =, or, +, -, |, [, end of input, ==, -=, |=, //, *=, <=, ., !=, ,, >, <
   ╭─[<unknown>:1:2]
   │
 1 │ 0012
   │  ┬  
   │  ╰── Unexpected token 0
───╯

$ jq -n '0012'
12

Anyway, while attempting to work with these numbers one could hope to use the tonumber filter, but that's also implemented in terms of fromjson, so no luck there.

My suggestion is to either:

  • document the non-leniency of the JSON parser, and the difference with jq's
  • provide a tonumber filter that's more tolerant

kklingenberg avatar Mar 17 '24 15:03 kklingenberg

An example of another side-effect of the current implementation of tonumber:

$ echo '"{}"' | jaq tonumber
{}

kklingenberg avatar Mar 17 '24 17:03 kklingenberg

Related https://github.com/jqlang/jq/pull/3055 jq used to allow whitespaces for tonumber but not anymore

wader avatar Mar 17 '24 17:03 wader

Interesting. So yet another side effect of tonumber just being fromjson is that it tolerates whitespace:

$ echo ' 12 ' | jaq -Rc '[., tonumber]'
[" 12 ",12]

kklingenberg avatar Mar 17 '24 17:03 kklingenberg

@kklingenberg - Good catch re jaq -n '"{}"|tonumber'. That's a bug that needs fixing.

Since different dialects of jq have and will probably continue to have very different implementations of tonumber, I think it would be good if jaq could lead the way with respect to a non-strict version, and in that spirit I'd like to propose that tonumber(regex) be defined using match/1, perhaps along the following lines:

def tonumber(regex): match(regex).string | sub("^00*"; "0") | strict_tonumber;

it being understood that strict_tonumber is a strict version of tonumber, i.e. it would result in an error if its string input does not conform to the JSON specification of a number.

pkoppstein avatar Mar 17 '24 19:03 pkoppstein

Regarding the "weird" number parsing behaviour for "0012": This is unfortunate, I agree, but it stems from the fact that sequences of JSON values are not standardised (I believe). First, JSON numbers cannot have multiple leading 0s, as we can see by the JSON spec, so as soon as a leading 0 is not followed by [1-9] or [.eE], we know that we are dealing with just the number 0, and everything else is part of a new value. Next, jq allows values to be concatenated without whitespace, such as [1][2]. So I generalised this to allowing concatenation of any JSON values without whitespace. That includes numbers, and this is responsible for the behaviour exposed by parsing "0012". I'm not saying that this behaviour is very intuitive. But I think that it is consistent.

01mf02 avatar Apr 09 '24 10:04 01mf02

Regarding tonumber, I still have to think a bit about how to do this best ...

01mf02 avatar Apr 09 '24 10:04 01mf02