jq icon indicating copy to clipboard operation
jq copied to clipboard

jq is partially laxer in what it parses when used with `--stream` flag

Open UmbrellaDish opened this issue 1 year ago • 8 comments

Describe the bug When jq is used to validate any json input of arbitrary size to ensure there are no syntax errors inside, we would like to favor --stream flag to enable validation of files the internal representation of which exceeds our RAM or any imposed limits.

To Reproduce

% cat /tmp/test-jq.sh 
#!/bin/sh

check_JSON_validity() {
    json="$1"; shift
    printf '%s' "$json" | jq "$@" > /dev/null 2>&1
    if [ $? -eq 0 ]; then
        printf 'valid json'
    else
        printf 'jq returned %d' $?
    fi
}

jq --version || { >&2 echo "jq is not installed."; exit 1; }

compare_validity_checks() {
    INPUT="$1"
    if [ "$(check_JSON_validity "$INPUT")" = "$(check_JSON_validity "$INPUT" --stream)" ]; then
        echo "OK, that version of jq consistently validates input no matter if --stream mode is on or off."
        return 0
    else
        echo "NOT OK, that version of jq inconsistently chokes on invalid input depending on --stream flag."  
        return 1
    fi
}

compare_validity_checks '{["invalid", "input"]}'
compare_validity_checks '{"a":["valid", "input"]}'
compare_validity_checks '{"a":b["valid", "input"]}'
me@linux ~ % sh /tmp/test-jq.sh
jq-1.6
NOT OK, that version of jq inconsistently chokes on invalid input depending on --stream flag.
OK, that version of jq consistently validates input no matter if --stream mode is on or off.
OK, that version of jq consistently validates input no matter if --stream mode is on or off.

Expected behavior

jq should not be laxer concerning its interpretation of input in stream mode than it is in normal mode. If the laxness is immanently due to how stream mode works, you might rather consider an --early-free-memory SELECTORS option that I could set to '.' for mere validation purposes?

Environment (please complete the following information):

  • Ubuntu Linux 20.04
  • jq 1.6

Additional context Long time preservation endeavours will also cover JSON files while they become more and more prominent in data processing across all branches and sciences. We need to validate all that data before we can archive it without risking to run out of memory digesting large datasets.

UmbrellaDish avatar Aug 10 '22 13:08 UmbrellaDish

Nice find, looks like a bug in jv_parse.c:stream_token. Maybe a workaround for now is something like jq --stream '.[0][] | nulls | error("null path entry")'?

But i'm not sure how much care have been taken to make sure streaming mode does strict JSON validation.

wader avatar Aug 10 '22 14:08 wader

With the proposed selectors applied, "NOT OK" goes just a line down, quite nothing's won. ;)

But i'm not sure how much care have been taken to make sure streaming mode does strict JSON validation.

Yes, syntax validation is admittedly kind of an off-label use of jq. Is therefore this issue closed, I would not reopen it. But we might need to write a JSON validator on our own then. Maybe we could leverage object_hook=replace_whatever_by_null with json.load in python3.

UmbrellaDish avatar Aug 10 '22 15:08 UmbrellaDish

With the proposed selectors applied, "NOT OK" goes just a line down, quite nothing's won. ;)

Aha ok, didn't see that. With jq master it seems to be a bit different.

$ echo '{"a":b["valid", "input"]}' | jq --stream .
jq: parse error: Invalid numeric literal at line 1, column 7
$ echo '{"a":b["valid", "input"]}' | jq .
jq: parse error: Invalid numeric literal at line 1, column 7

Yes, syntax validation is admittedly kind of an off-label use of jq. Is therefore this issue closed, I would not reopen it. But we might need to write a JSON validator on our own then. Maybe we could leverage object_hook=replace_whatever_by_null with json.load in python3.

Yes probably a good idea

wader avatar Aug 10 '22 16:08 wader

If jq is not doing what you want, then you might wish to consider using gojq, the Go implementation of jq: it rejects your invalid JSON early, with an error message and an error code of 5.

pkoppstein avatar Aug 10 '22 19:08 pkoppstein

@wader, jq master accepts the .[0][] | nulls | error("null path entry") selector with input of the second test without --stream flag?

On Ubuntu 20.04, jq-1.6 does erroneously reject it, but accepts it with --stream applied:

# jq '.[0][] | nulls | error("null path entry")' <<<'{"a":["valid", "input"]}' 
jq: error (at <stdin>:1): Cannot index object with number
# jq --stream '.[0][] | nulls | error("null path entry")' <<<'{"a":["valid", "input"]}'
# 

UmbrellaDish avatar Aug 11 '22 09:08 UmbrellaDish

Yes jq master seems to be consistent:

$ sh test-jq.sh
jq-1.6-139-ga9ce724
OK, that version of jq consistently validates input no matter if --stream mode is on or off.
OK, that version of jq consistently validates input no matter if --stream mode is on or off.
OK, that version of jq consistently validates input no matter if --stream mode is on or off.

wader avatar Aug 11 '22 09:08 wader

Actually i think i'm mistaken, i noticed now that jq master exits with 0 on all (?) parse errors. Strange, wonder what could have changed this. If i add -e it seems to exit with 4 on parse errors.

Have a feeling PR https://github.com/stedolan/jq/pull/1697 is related to the behaviour change, also issue https://github.com/stedolan/jq/issues/2146 seems to be about the same. A bit tricky, if there are multiple JSON input:s should just print error and move along and exit with 0 (if not -e is provided) or should it exit with 1 or 4?

wader avatar Aug 11 '22 09:08 wader

Just thought I'll post my workaround python script for memory-efficient validation of arbitrarily large JSON input. It just replaces all objects (key-value mappings) by null straight after reading, effectively freeing memory from all data within:

diff --git a/path/to/my/json-syntax-validator.py b/path/to/my/json-syntax-validator.py
new file mode 100755
index 0000000..0564fa8
--- /dev/null
+++ b/path/to/my/json-syntax-validator.py
@@ -0,0 +1,17 @@
+#!/usr/bin/env python3
+"""
+json-syntax-validator.py - JSON syntax validator for arbitrary large JSON data
+"""
+
+import json, sys
+
+infile = open('/dev/fd/0' if len(sys.argv) == 1 else sys.argv[1])
+
+called = 0
+def notify_drop(d):
+    global called
+    called += 1
+    return None
+
+print(json.load(infile, object_hook=notify_drop))
+print("Called: " + str(called))

Just in case to you ponder about a --early-free-memory-of-unneeded SELECTORS, also consider lists and strings that could occupy hurtingly much RAM if huge.

UmbrellaDish avatar Aug 11 '22 10:08 UmbrellaDish