jaq
jaq copied to clipboard
Parsing zero-padded numbers
While parsing zero-padded numbers I came across this minor issue. This is a minimal example:
$ echo "0012" | jaq .
0
0
12
Whereas jq yields just 12
.
This is serde_json
at work, which in turn is probably following JSON's spec (is my guess). This is another view at the issue:
$ echo "0012" | jaq -R fromjson
Error: cannot parse 0012 as JSON: end of file expected
Also, the lexer rejects these numbers too (which is fine, and consistent with the JSON parser). jq is also consistent with its lenient parser:
$ jaq -n '0012'
Error: Unexpected token, expected as, *, +=, /=, %=, >=, /, ?, %, and, =, or, +, -, |, [, end of input, ==, -=, |=, //, *=, <=, ., !=, ,, >, <
╭─[<unknown>:1:2]
│
1 │ 0012
│ ┬
│ ╰── Unexpected token 0
───╯
$ jq -n '0012'
12
Anyway, while attempting to work with these numbers one could hope to use the tonumber
filter, but that's also implemented in terms of fromjson
, so no luck there.
My suggestion is to either:
- document the non-leniency of the JSON parser, and the difference with jq's
- provide a
tonumber
filter that's more tolerant
An example of another side-effect of the current implementation of tonumber
:
$ echo '"{}"' | jaq tonumber
{}
Related https://github.com/jqlang/jq/pull/3055 jq used to allow whitespaces for tonumber
but not anymore
Interesting. So yet another side effect of tonumber
just being fromjson
is that it tolerates whitespace:
$ echo ' 12 ' | jaq -Rc '[., tonumber]'
[" 12 ",12]
@kklingenberg - Good catch re jaq -n '"{}"|tonumber'
. That's a bug that needs fixing.
Since different dialects of jq have and will probably continue to have very different implementations of tonumber
,
I think it would be good if jaq could lead the way with respect to a non-strict version, and in that spirit
I'd like to propose that tonumber(regex)
be defined using match/1
, perhaps along the following lines:
def tonumber(regex): match(regex).string | sub("^00*"; "0") | strict_tonumber;
it being understood that strict_tonumber
is a strict version of tonumber
, i.e. it would result in an error if its string input does not conform to the JSON specification of a number.
Regarding the "weird" number parsing behaviour for "0012": This is unfortunate, I agree, but it stems from the fact that sequences of JSON values are not standardised (I believe). First, JSON numbers cannot have multiple leading 0
s, as we can see by the JSON spec, so as soon as a leading 0
is not followed by [1-9]
or [.eE]
, we know that we are dealing with just the number 0
, and everything else is part of a new value. Next, jq allows values to be concatenated without whitespace, such as [1][2]
. So I generalised this to allowing concatenation of any JSON values without whitespace. That includes numbers, and this is responsible for the behaviour exposed by parsing "0012".
I'm not saying that this behaviour is very intuitive. But I think that it is consistent.
Regarding tonumber
, I still have to think a bit about how to do this best ...