jq icon indicating copy to clipboard operation
jq copied to clipboard

Make maximum depth configurable via CLI option

Open mgree opened this issue 1 year ago • 18 comments

Addresses #2846. (I didn't see any contributor guidelines/linting info... please let me know if you'd like me to change anything!)

This PR adds state to two places to track maximum parser depth: in jq_state and in jv_parser.

It might seem like jv_parser is enough on its own, but many other parts of the code parser values holding only a jq_state. It's not clear to me that these aren't mostly/entirely raw parses (which wouldn't need to track depth), but it seemed easiest to track the state thoroughly.

There are a few helper functions at the end of jv_parse.c that have no handle at all---neither a jq_state nor a jv_parser. I had these just use the DEFAULT_MAX_PARSING_DEPTH. They are mostly for fuzzing, it seems, but jv_parse gets called in main during option processing. It wouldn't be hard to have this use the current parser_maxdepth.

mgree avatar Mar 11 '24 15:03 mgree

Implementation looks good to me. There was some discussion in #2846 about what the default should be, unbounded or 256 (or something else?).

wader avatar Mar 12 '24 13:03 wader

Should add CLI documentation to manual.yml

wader avatar Mar 12 '24 13:03 wader

Should add CLI documentation to manual.yml

In 3b0d520.

Implementation looks good to me. There was some discussion in #2846 about what the default should be, unbounded or 256 (or something else?).

Thanks!

I would recommend leaving the default maximum parsing depth at 256, as changing the default depth to unbounded opens up a DoS vector. It might make sense to recommend the --depth argument in the error message... should I add that in to the message?

mgree avatar Mar 12 '24 14:03 mgree

Should add CLI documentation to manual.yml

In 3b0d520.

👍

Implementation looks good to me. There was some discussion in #2846 about what the default should be, unbounded or 256 (or something else?).

Thanks!

I would recommend leaving the default maximum parsing depth at 256, as changing the default depth to unbounded opens up a DoS vector.

Ok, let's see what @nicowilliams thinks

It might make sense to recommend the --depth argument in the error message... should I add that in to the message?

Yeap sounds like a good idea

wader avatar Mar 12 '24 14:03 wader

You probably want to do make jq.1.prebuilt tests/man.test to regenerate the man page and man tests

wader avatar Mar 12 '24 15:03 wader

BTW https://greenberg.science/ layout is 🥇😄

wader avatar Mar 12 '24 15:03 wader

Do y'all have a sense of a timeline on this merging (and the next release)? (Not asking to rush you, just for planning on our end.)

mgree avatar Mar 20 '24 16:03 mgree

@mgree Hey, i pinged the other maintainers on discord yesterday about this PR and some other ones so hopefully they review it soon. As it adds a new cli argument it's probably good that at least 2 others approve it.

wader avatar Mar 20 '24 16:03 wader

Looking at the issue @mgree ran into in https://github.com/MaterializeInc/materialize/pull/25972 with the 256 depth limit i would say it's a good argument for increasing it or just make it unbounded by default.

wader avatar Mar 28 '24 19:03 wader

For Materialize's use case, we would want no depth limit : we're looking at non-adversarial inputs that we know to be deeply nested (ASTs with a particularly nested JSON representation). But we're happy to use --depth 0 on all of our invocations---NBD.

A quick experiment shows it's probably pretty safe to have no limit by default from a time performance perspective, but maybe not a memory or DoS perspective. Objects nested 1024 * 1024 deep make a file of 6MB but jq allocates 0.5GB before segfaulting; lists nested 1024 * 1024 deep make a file of 2MB but jq allocates 0.3GB before segfaulting. Scaling seems to be linear: if make an object of depth 1024 * 1024 * 4, the 25MB file leads jq to allocate just under 2GB, segfaulting after 1.7s.

At a minimum, I'd guess these segfaults should be caught and turned into proper aborts. At worst, maybe they're bugs? (Maybe with printing? Because a lot of work seems to happen before things crash...)

It's might be kindest to users to have a maximum depth by default---that way users don't discover the hard way that adversarial files can use lots of memory and/or crash jq. Not that I would use jq on nasty files willy nilly, but... people do all kinds of things. 😁

Here's a script for making deep JSON objects and lists (EDIT 2024-04-03: noticed a bug in deep_list.sh 🙄):

#!/bin/sh
# deep_obj.sh

jot -b '{"a":' -s '' $1 >obj$1.json
echo 0 >>obj$1.json
jot -b '}' -s '' $1 >>obj$1.json

#!/bin/sh
# deep_list.sh

jot -b '[' -s '' $1 >list$1.json
echo 0 >>list$1.json
jot -b ']' -s '' $1 >>list$1.json

And here's a run on my not fancy macOS laptop (EDIT 2024-04-03: re-ran with the bug fixed, didn't change overall results):

$ for x in 257 1024 $((1024 * 64)) $((1024 * 256)) $((1024 * 1024))
do
  ./deep_list.sh $x
  ./deep_obj.sh $x
  echo \# $x
  echo \#\# list
  command time -l jq --depth 0 . list$x.json >/dev/null
  echo \#\# obj
  command time -l jq --depth 0 . obj$x.json >/dev/null
done
# 257
## list
        0.01 real         0.00 user         0.00 sys
             2101248  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
                 722  page reclaims
                   0  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
                   0  voluntary context switches
                   7  involuntary context switches
            76583002  instructions retired
            35534464  cycles elapsed
             1290240  peak memory footprint
## obj
        0.01 real         0.00 user         0.00 sys
             2007040  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
                 701  page reclaims
                   0  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
                   0  voluntary context switches
                  14  involuntary context switches
            77345793  instructions retired
            35134742  cycles elapsed
             1191936  peak memory footprint
# 1024
## list
        0.01 real         0.00 user         0.00 sys
             2322432  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
                 776  page reclaims
                   0  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
                   0  voluntary context switches
                  20  involuntary context switches
            77963756  instructions retired
            37622876  cycles elapsed
             1273856  peak memory footprint
## obj
        0.01 real         0.00 user         0.00 sys
             2375680  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
                 789  page reclaims
                   0  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
                   0  voluntary context switches
                  12  involuntary context switches
            79654892  instructions retired
            36269047  cycles elapsed
             1228800  peak memory footprint
# 65536
## list
        0.03 real         0.02 user         0.01 sys
            24293376  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
                6139  page reclaims
                   0  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
                   0  voluntary context switches
                  37  involuntary context switches
           199628394  instructions retired
           114804163  cycles elapsed
            23146496  peak memory footprint
## obj
        0.05 real         0.03 user         0.01 sys
            36016128  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
                9003  page reclaims
                   0  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
                   0  voluntary context switches
                  32  involuntary context switches
           289287462  instructions retired
           161012325  cycles elapsed
            32870400  peak memory footprint
# 262144
## list
time: command terminated abnormally
        0.07 real         0.05 user         0.02 sys
            86921216  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
               21431  page reclaims
                   0  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
                   0  voluntary context switches
                  63  involuntary context switches
           387970009  instructions retired
           252783614  cycles elapsed
            85774336  peak memory footprint
Segmentation fault: 11
## obj
time: command terminated abnormally
        0.13 real         0.09 user         0.04 sys
           138883072  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
               35396  page reclaims
                   0  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
                   0  voluntary context switches
                  63  involuntary context switches
           712225649  instructions retired
           442663148  cycles elapsed
           126955520  peak memory footprint
Segmentation fault: 11
# 1048576
## list
time: command terminated abnormally
        0.26 real         0.17 user         0.08 sys
           327397376  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
               82700  page reclaims
                   0  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
                   0  voluntary context switches
                  82  involuntary context switches
          1311119871  instructions retired
           847893675  cycles elapsed
           321007616  peak memory footprint
Segmentation fault: 11
## obj
time: command terminated abnormally
        0.46 real         0.30 user         0.15 sys
           514150400  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
              130856  page reclaims
                   0  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
                   0  voluntary context switches
                 199  involuntary context switches
          2446990455  instructions retired
          1517820450  cycles elapsed
           502243328  peak memory footprint
Segmentation fault: 11

mgree avatar Mar 29 '24 18:03 mgree

Hey, I wanted to check in again. Would you like me to file correctness issues for these segfaults? Is there anything I can do to help with the review of this patch?

mgree avatar May 01 '24 18:05 mgree

@mgree force pushed rebase on master that fixes conflict in manual.yml

@itchyny @emanuele6 looks ok to merge? fixes https://github.com/jqlang/jq/issues/2846 that is on the 1.8 milestone. The only thing left for me would be if we should increase DEFAULT_MAX_PARSING_DEPTH while we're at it?

wader avatar Aug 24 '24 15:08 wader

Rebased and fix conflicts again

wader avatar Nov 25 '24 13:11 wader

Hmm i wonder if --max-depth might be a clearer arg name?

wader avatar Nov 25 '24 13:11 wader

I used --depth by analogy to find's -depth, but clarity is good! LMK if I can do anything to help.

mgree avatar Nov 25 '24 14:11 mgree

We need to touch up the JSON parser before doing a release anyway because of all the decNum related now-disclosed vulernabilities in the parser, so maybe we should hold off on merging a patch like this that may end up not being necessary

emanuele6 avatar Nov 25 '24 14:11 emanuele6

But if you think it's useful to have regardless to set a limit, I guess that is fine.


Isn't the (GNU) find option that does something like this also called -max-depth though? ^^ -depth is about the order of iteration, and then I know that BSD has -depth n that is a check that is true only if the current path is at depth n

emanuele6 avatar Nov 25 '24 14:11 emanuele6

You're right---it's -maxdepth. I have literally zero feelings about what to call this option.

Whatever changes happen in the parser, there should be a way to control how it behaves on potentially expensive inputs.

mgree avatar Nov 28 '24 15:11 mgree