jq
jq copied to clipboard
Make maximum depth configurable via CLI option
Addresses #2846. (I didn't see any contributor guidelines/linting info... please let me know if you'd like me to change anything!)
This PR adds state to two places to track maximum parser depth: in jq_state and in jv_parser.
It might seem like jv_parser is enough on its own, but many other parts of the code parser values holding only a jq_state. It's not clear to me that these aren't mostly/entirely raw parses (which wouldn't need to track depth), but it seemed easiest to track the state thoroughly.
There are a few helper functions at the end of jv_parse.c that have no handle at all---neither a jq_state nor a jv_parser. I had these just use the DEFAULT_MAX_PARSING_DEPTH. They are mostly for fuzzing, it seems, but jv_parse gets called in main during option processing. It wouldn't be hard to have this use the current parser_maxdepth.
Implementation looks good to me. There was some discussion in #2846 about what the default should be, unbounded or 256 (or something else?).
Should add CLI documentation to manual.yml
Should add CLI documentation to manual.yml
In 3b0d520.
Implementation looks good to me. There was some discussion in #2846 about what the default should be, unbounded or 256 (or something else?).
Thanks!
I would recommend leaving the default maximum parsing depth at 256, as changing the default depth to unbounded opens up a DoS vector. It might make sense to recommend the --depth argument in the error message... should I add that in to the message?
Should add CLI documentation to manual.yml
In 3b0d520.
👍
Implementation looks good to me. There was some discussion in #2846 about what the default should be, unbounded or 256 (or something else?).
Thanks!
I would recommend leaving the default maximum parsing depth at 256, as changing the default depth to unbounded opens up a DoS vector.
Ok, let's see what @nicowilliams thinks
It might make sense to recommend the
--depthargument in the error message... should I add that in to the message?
Yeap sounds like a good idea
You probably want to do make jq.1.prebuilt tests/man.test to regenerate the man page and man tests
BTW https://greenberg.science/ layout is 🥇😄
Do y'all have a sense of a timeline on this merging (and the next release)? (Not asking to rush you, just for planning on our end.)
@mgree Hey, i pinged the other maintainers on discord yesterday about this PR and some other ones so hopefully they review it soon. As it adds a new cli argument it's probably good that at least 2 others approve it.
Looking at the issue @mgree ran into in https://github.com/MaterializeInc/materialize/pull/25972 with the 256 depth limit i would say it's a good argument for increasing it or just make it unbounded by default.
For Materialize's use case, we would want no depth limit : we're looking at non-adversarial inputs that we know to be deeply nested (ASTs with a particularly nested JSON representation). But we're happy to use --depth 0 on all of our invocations---NBD.
A quick experiment shows it's probably pretty safe to have no limit by default from a time performance perspective, but maybe not a memory or DoS perspective. Objects nested 1024 * 1024 deep make a file of 6MB but jq allocates 0.5GB before segfaulting; lists nested 1024 * 1024 deep make a file of 2MB but jq allocates 0.3GB before segfaulting. Scaling seems to be linear: if make an object of depth 1024 * 1024 * 4, the 25MB file leads jq to allocate just under 2GB, segfaulting after 1.7s.
At a minimum, I'd guess these segfaults should be caught and turned into proper aborts. At worst, maybe they're bugs? (Maybe with printing? Because a lot of work seems to happen before things crash...)
It's might be kindest to users to have a maximum depth by default---that way users don't discover the hard way that adversarial files can use lots of memory and/or crash jq. Not that I would use jq on nasty files willy nilly, but... people do all kinds of things. 😁
Here's a script for making deep JSON objects and lists (EDIT 2024-04-03: noticed a bug in deep_list.sh 🙄):
#!/bin/sh
# deep_obj.sh
jot -b '{"a":' -s '' $1 >obj$1.json
echo 0 >>obj$1.json
jot -b '}' -s '' $1 >>obj$1.json
#!/bin/sh
# deep_list.sh
jot -b '[' -s '' $1 >list$1.json
echo 0 >>list$1.json
jot -b ']' -s '' $1 >>list$1.json
And here's a run on my not fancy macOS laptop (EDIT 2024-04-03: re-ran with the bug fixed, didn't change overall results):
$ for x in 257 1024 $((1024 * 64)) $((1024 * 256)) $((1024 * 1024))
do
./deep_list.sh $x
./deep_obj.sh $x
echo \# $x
echo \#\# list
command time -l jq --depth 0 . list$x.json >/dev/null
echo \#\# obj
command time -l jq --depth 0 . obj$x.json >/dev/null
done
# 257
## list
0.01 real 0.00 user 0.00 sys
2101248 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
722 page reclaims
0 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
0 signals received
0 voluntary context switches
7 involuntary context switches
76583002 instructions retired
35534464 cycles elapsed
1290240 peak memory footprint
## obj
0.01 real 0.00 user 0.00 sys
2007040 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
701 page reclaims
0 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
0 signals received
0 voluntary context switches
14 involuntary context switches
77345793 instructions retired
35134742 cycles elapsed
1191936 peak memory footprint
# 1024
## list
0.01 real 0.00 user 0.00 sys
2322432 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
776 page reclaims
0 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
0 signals received
0 voluntary context switches
20 involuntary context switches
77963756 instructions retired
37622876 cycles elapsed
1273856 peak memory footprint
## obj
0.01 real 0.00 user 0.00 sys
2375680 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
789 page reclaims
0 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
0 signals received
0 voluntary context switches
12 involuntary context switches
79654892 instructions retired
36269047 cycles elapsed
1228800 peak memory footprint
# 65536
## list
0.03 real 0.02 user 0.01 sys
24293376 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
6139 page reclaims
0 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
0 signals received
0 voluntary context switches
37 involuntary context switches
199628394 instructions retired
114804163 cycles elapsed
23146496 peak memory footprint
## obj
0.05 real 0.03 user 0.01 sys
36016128 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
9003 page reclaims
0 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
0 signals received
0 voluntary context switches
32 involuntary context switches
289287462 instructions retired
161012325 cycles elapsed
32870400 peak memory footprint
# 262144
## list
time: command terminated abnormally
0.07 real 0.05 user 0.02 sys
86921216 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
21431 page reclaims
0 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
0 signals received
0 voluntary context switches
63 involuntary context switches
387970009 instructions retired
252783614 cycles elapsed
85774336 peak memory footprint
Segmentation fault: 11
## obj
time: command terminated abnormally
0.13 real 0.09 user 0.04 sys
138883072 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
35396 page reclaims
0 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
0 signals received
0 voluntary context switches
63 involuntary context switches
712225649 instructions retired
442663148 cycles elapsed
126955520 peak memory footprint
Segmentation fault: 11
# 1048576
## list
time: command terminated abnormally
0.26 real 0.17 user 0.08 sys
327397376 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
82700 page reclaims
0 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
0 signals received
0 voluntary context switches
82 involuntary context switches
1311119871 instructions retired
847893675 cycles elapsed
321007616 peak memory footprint
Segmentation fault: 11
## obj
time: command terminated abnormally
0.46 real 0.30 user 0.15 sys
514150400 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
130856 page reclaims
0 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
0 signals received
0 voluntary context switches
199 involuntary context switches
2446990455 instructions retired
1517820450 cycles elapsed
502243328 peak memory footprint
Segmentation fault: 11
Hey, I wanted to check in again. Would you like me to file correctness issues for these segfaults? Is there anything I can do to help with the review of this patch?
@mgree force pushed rebase on master that fixes conflict in manual.yml
@itchyny @emanuele6 looks ok to merge? fixes https://github.com/jqlang/jq/issues/2846 that is on the 1.8 milestone. The only thing left for me would be if we should increase DEFAULT_MAX_PARSING_DEPTH while we're at it?
Rebased and fix conflicts again
Hmm i wonder if --max-depth might be a clearer arg name?
I used --depth by analogy to find's -depth, but clarity is good! LMK if I can do anything to help.
We need to touch up the JSON parser before doing a release anyway because of all the decNum related now-disclosed vulernabilities in the parser, so maybe we should hold off on merging a patch like this that may end up not being necessary
But if you think it's useful to have regardless to set a limit, I guess that is fine.
Isn't the (GNU) find option that does something like this also called -max-depth though? ^^
-depth is about the order of iteration, and then I know that BSD has -depth n that is a check that is true only if the current path is at depth n
You're right---it's -maxdepth. I have literally zero feelings about what to call this option.
Whatever changes happen in the parser, there should be a way to control how it behaves on potentially expensive inputs.