LibCST Deeply nested binary arithmetic segfaults under native parser

This is definietly not a realistic code (but it is still used in projects like this), though I think it might point to an underlying stack overflow or something like that (no experience in debugging rust applications, so unfortunately can't provide any more insight than this)

Here is the exampe payload:

(
    'X' + 'Y' + 'Z' + 'Q' + 'T' +
    'X' + 'Y' + 'Z' + 'Q' + 'T' +
    'X' + 'Y' + 'Z' + 'Q' + 'T' + 
    'X' + 'Y' + 'Z' + 'Q' + 'T' + 
    'X' + 'Y' + 'Z' + 'Q' + 'T' + 
    'X' + 'Y' + 'Z' + 'Q' + 'T' + 
    'X' + 'Y' + 'Z' + 'Q' + 'T' + 
    'X' + 'Y' + 'Z' + 'Q' + 'T' +
    'X' + 'Y' + 'Z' + 'Q' + 'T' +
    'X' + 'Y' + 'Z' + 'Q' + 'T' + 
    'X' + 'Y' + 'Z' + 'Q' + 'T' +
    'X' + 'Y' + 'Z' + 'Q' + 'T' +
    'X' + 'Y' + 'Z' + 'Q' + 'T' +
    'X' + 'Y' + 'Z' + 'Q' + 'T' +
    'X' + 'Y' + 'Z' + 'Q' + 'T' +
    'X' + 'Y' + 'Z' + 'Q' + 'T' +
    'X' + 'Y' + 'Z' + 'Q' + 'T' +
    'X' + 'Y' + 'Z' + 'Q' + 'T' +
    'X' + 'Y' + 'Z' + 'Q' + 'T' +
    'X' + 'Y' + 'Z' + 'Q' + 'T' +
    'X' + 'Y' + 'Z' + 'Q' + 'T' +
    'X' + 'Y' + 'Z' + 'Q' + 'T' +
    'X' + 'Y' + 'Z' + 'Q' + 'T' +
    'X' + 'Y' + 'Z' + 'Q' + 'T' +
    'X' + 'Y' + 'Z' + 'Q' + 'T' +
    'X' + 'Y' + 'Z' + 'Q' + 'T' +
    'X' + 'Y' + 'Z' + 'Q' + 'T' +
    'X' + 'Y' + 'Z' + 'Q' + 'T' +
    'X' + 'Y' + 'Z' + 'Q' + 'T' +
    'X' + 'Y' + 'Z' + 'Q' + 'T' +
    'X' + 'Y' + 'Z' + 'Q' + 'T' +
    'X' + 'Y' + 'Z' + 'Q' + 'T' +
    'X' + 'Y' + 'Z' + 'Q' + 'T'
)

Works perfectly with the python parser, but the native one crashes:

 $ LIBCST_PARSER_TYPE=native python -m libcst.tool print t.py                                                 444ms
[1]    11295 segmentation fault (core dumped)  LIBCST_PARSER_TYPE=native python -m libcst.tool print t.py

Jan 22 '22 15:01 isidentical

Hmmm, I can't reproduce this with the release build. The debug build does throw a stack overflow, but that's kinda expected.

Jan 25 '22 11:01 zsol

Alright I can repro with 5000 + operators :) We need this to not crash, but it'd be nice to handle arbitrarily nested expressions.

(BTW the original parser works fine even with this many operators, but printing the tree fails with a RecursionError, so I'm not sure we can reasonably support huge expressions like this at the moment)

Jan 25 '22 11:01 zsol

Ah, I can also confirm. The parser itself is working fine (or giving a stack overflow) but the problem seems to be at libcst.tool print which is just segfaulting (under LIBCST_PARSER_TYPE=native):

$ ./target/release/parse < t.py                                                                                3ms
(
   [...]
)

$ ./target/debug/parse < t.py                                                                                 15ms
thread 'main' has overflowed its stack
fatal runtime error: stack overflow
[1]    18897 IOT instruction (core dumped)  ./target/debug/parse < t.py

 $ python -m libcst.tool print t.py                                                                             0ms
[1]    18999 segmentation fault (core dumped)  python -m libcst.tool print t.py

Jan 25 '22 11:01 isidentical

Looking at the core file, the bottleneck is not the parsing (strictly speaking), but the cloning of the excessively nested datastructure causing ~3500 stack frames. There are two things which we can do:

reduce the memory footprint of the rust-native CST objects. There are some low-hanging fruits here, that the rust linter even warns about. this enum for example is huge, but it doesn't need to be; and it appears on every stack of a BinaryOp's clone function.
Somehow limit the depth of the nesting in the parser itself. I'm not 100% sure how to do this, but we could have counters to increment on recursion.

Jan 25 '22 13:01 zsol

Somehow limit the depth of the nesting in the parser itself. I'm not 100% sure how to do this, but we could have counters to increment on recursion.

In CPython, that is exactly what we do. We hold an integer in the parser state, and increase it before each call (and check for the depth) and decrease it afterwards. https://github.com/python/cpython/blob/ee60550e9ba3ab94ca43a890cf4313b63ffa1a81/Tools/peg_generator/pegen/c_generator.py#L368-L373

Jan 25 '22 13:01 isidentical

#632 should help with any real world use cases. There's https://github.com/kevinmehall/rust-peg/issues/282 open to address this in the parser framework. I'll see if I can hack something up in the meanwhile.

Jan 27 '22 15:01 zsol

Amazing, thank you @zsol!

Jan 27 '22 15:01 isidentical

LibCST LibCST copied to clipboard

Deeply nested binary arithmetic segfaults under native parser

LibCST
LibCST copied to clipboard