jq
jq copied to clipboard
jq is partially laxer in what it parses when used with `--stream` flag
Describe the bug
When jq is used to validate any json input of arbitrary size to ensure there are no syntax errors inside, we would like to favor --stream
flag to enable validation of files the internal representation of which exceeds our RAM or any imposed limits.
To Reproduce
% cat /tmp/test-jq.sh
#!/bin/sh
check_JSON_validity() {
json="$1"; shift
printf '%s' "$json" | jq "$@" > /dev/null 2>&1
if [ $? -eq 0 ]; then
printf 'valid json'
else
printf 'jq returned %d' $?
fi
}
jq --version || { >&2 echo "jq is not installed."; exit 1; }
compare_validity_checks() {
INPUT="$1"
if [ "$(check_JSON_validity "$INPUT")" = "$(check_JSON_validity "$INPUT" --stream)" ]; then
echo "OK, that version of jq consistently validates input no matter if --stream mode is on or off."
return 0
else
echo "NOT OK, that version of jq inconsistently chokes on invalid input depending on --stream flag."
return 1
fi
}
compare_validity_checks '{["invalid", "input"]}'
compare_validity_checks '{"a":["valid", "input"]}'
compare_validity_checks '{"a":b["valid", "input"]}'
me@linux ~ % sh /tmp/test-jq.sh
jq-1.6
NOT OK, that version of jq inconsistently chokes on invalid input depending on --stream flag.
OK, that version of jq consistently validates input no matter if --stream mode is on or off.
OK, that version of jq consistently validates input no matter if --stream mode is on or off.
Expected behavior
jq
should not be laxer concerning its interpretation of input in stream mode than it is in normal mode. If the laxness is immanently due to how stream mode works, you might rather consider an --early-free-memory SELECTORS
option that I could set to '.' for mere validation purposes?
Environment (please complete the following information):
- Ubuntu Linux 20.04
- jq 1.6
Additional context Long time preservation endeavours will also cover JSON files while they become more and more prominent in data processing across all branches and sciences. We need to validate all that data before we can archive it without risking to run out of memory digesting large datasets.
Nice find, looks like a bug in jv_parse.c:stream_token
. Maybe a workaround for now is something like jq --stream '.[0][] | nulls | error("null path entry")'
?
But i'm not sure how much care have been taken to make sure streaming mode does strict JSON validation.
With the proposed selectors applied, "NOT OK" goes just a line down, quite nothing's won. ;)
But i'm not sure how much care have been taken to make sure streaming mode does strict JSON validation.
Yes, syntax validation is admittedly kind of an off-label use of jq. Is therefore this issue closed, I would not reopen it. But we might need to write a JSON validator on our own then. Maybe we could leverage object_hook=replace_whatever_by_null with json.load in python3.
With the proposed selectors applied, "NOT OK" goes just a line down, quite nothing's won. ;)
Aha ok, didn't see that. With jq master it seems to be a bit different.
$ echo '{"a":b["valid", "input"]}' | jq --stream .
jq: parse error: Invalid numeric literal at line 1, column 7
$ echo '{"a":b["valid", "input"]}' | jq .
jq: parse error: Invalid numeric literal at line 1, column 7
Yes, syntax validation is admittedly kind of an off-label use of jq. Is therefore this issue closed, I would not reopen it. But we might need to write a JSON validator on our own then. Maybe we could leverage object_hook=replace_whatever_by_null with json.load in python3.
Yes probably a good idea
If jq is not doing what you want, then you might wish to consider using gojq, the Go implementation of jq: it rejects your invalid JSON early, with an error message and an error code of 5.
@wader, jq master accepts the .[0][] | nulls | error("null path entry")
selector with input of the second test without --stream
flag?
On Ubuntu 20.04, jq-1.6 does erroneously reject it, but accepts it with --stream
applied:
# jq '.[0][] | nulls | error("null path entry")' <<<'{"a":["valid", "input"]}'
jq: error (at <stdin>:1): Cannot index object with number
# jq --stream '.[0][] | nulls | error("null path entry")' <<<'{"a":["valid", "input"]}'
#
Yes jq master seems to be consistent:
$ sh test-jq.sh
jq-1.6-139-ga9ce724
OK, that version of jq consistently validates input no matter if --stream mode is on or off.
OK, that version of jq consistently validates input no matter if --stream mode is on or off.
OK, that version of jq consistently validates input no matter if --stream mode is on or off.
Actually i think i'm mistaken, i noticed now that jq master exits with 0 on all (?) parse errors. Strange, wonder what could have changed this. If i add -e
it seems to exit with 4 on parse errors.
Have a feeling PR https://github.com/stedolan/jq/pull/1697 is related to the behaviour change, also issue https://github.com/stedolan/jq/issues/2146 seems to be about the same. A bit tricky, if there are multiple JSON input:s should just print error and move along and exit with 0 (if not -e is provided) or should it exit with 1 or 4?
Just thought I'll post my workaround python script for memory-efficient validation of arbitrarily large JSON input. It just replaces all objects (key-value mappings) by null straight after reading, effectively freeing memory from all data within:
diff --git a/path/to/my/json-syntax-validator.py b/path/to/my/json-syntax-validator.py
new file mode 100755
index 0000000..0564fa8
--- /dev/null
+++ b/path/to/my/json-syntax-validator.py
@@ -0,0 +1,17 @@
+#!/usr/bin/env python3
+"""
+json-syntax-validator.py - JSON syntax validator for arbitrary large JSON data
+"""
+
+import json, sys
+
+infile = open('/dev/fd/0' if len(sys.argv) == 1 else sys.argv[1])
+
+called = 0
+def notify_drop(d):
+ global called
+ called += 1
+ return None
+
+print(json.load(infile, object_hook=notify_drop))
+print("Called: " + str(called))
Just in case to you ponder about a --early-free-memory-of-unneeded SELECTORS
, also consider lists and strings that could occupy hurtingly much RAM if huge.