nom icon indicating copy to clipboard operation
nom copied to clipboard

[wishlist] Option to limit recursion depth

Open kandykoma opened this issue 4 years ago • 10 comments

Issue

Deep recursions trigger a stack overflow, that canot be handled.

Consequence

Cannot use recursive nom parser on untrusted input, as cannot protect against DoS.

Demo

A minimal repro case:

fn cp(i: &str) -> nom::IResult<&str, &str> {
    use nom::branch::alt;
    use nom::bytes::complete::tag;
    use nom::sequence::delimited;

    delimited(tag("("), alt((tag("x"), cp)), tag(")"))(i)
}

fn main() {
    let mut buffer = String::new();
    let stdin = std::io::stdin();
    let mut handle = stdin.lock();

    {
        use std::io::Read;
        handle.read_to_string(&mut buffer).expect("read failed");
    }
    println!("{:?}", cp(&buffer));
}

With input:

perl -e 'print("(" x 10000, "x", ")" x 10000, "\n")' | ./target/debug/main

reports:

thread 'main' has overflowed its stack
fatal runtime error: stack overflow
Aborted

Improvement Option

Intruduce a maximal depth limit, which when reached gracefuly terminates the parsing.

kandykoma avatar Mar 13 '20 01:03 kandykoma

Why don’t you just have something like cp which calls my_cp(0) and my_cp(n) which calls my_cp(n+1) in the alt?

This way you can throw an error when n is too big without any modification in nom

Keruspe avatar Mar 23 '20 16:03 Keruspe

Thanks @Keruspe, it works!

Two follow ups:

  • is there a better way to trigger an error?
  • is there a way to not only signal a matching failure, but trigger an abort of the parsing process?
fn cp(n: u32) -> impl Fn(&str) -> nom::IResult<&str, &str> {
    use nom::branch::alt;
    use nom::bytes::complete::tag;
    use nom::sequence::delimited;
    move |i| {
        if n > 10 {
            tag("never matching string")(i)
        } else {
            delimited(tag("("), alt((tag("x"), cp(n+1))), tag(")"))(i)
        }
    }
}

fn main() {
    let mut buffer = String::new();
    let stdin = std::io::stdin();
    let mut handle = stdin.lock();

    {
        use std::io::Read;
        handle.read_to_string(&mut buffer).expect("read failed");
    }
    println!("{:?}", cp(0)(&buffer));
}

kandykoma avatar Mar 23 '20 17:03 kandykoma

I'm not sure a custom input type would be the solution, as it would not handle having multiple parsers doing recursion (like hashes and arrays in JSON). And the suggestion from @Keruspe works only if the recursion is right in the same function, not multiple levels deeper.

I was experimenting with an ugly hack, using a static variable:

use nom::branch::alt;
use nom::bytes::complete::tag;
use nom::sequence::delimited;
use nom::{Parser, IResult, error::ParseError, error::ErrorKind};

fn cp(i: &str) -> nom::IResult<&str, &str> {
    delimited(tag("("), alt((tag("x"), rec(cp))), tag(")"))(i)
}

fn rec<I, O, E: ParseError<I>, F>(mut parser: F) -> impl FnMut(I) -> IResult<I, O, E>
    where F: Parser<I, O, E>
    {
        static mut REC: u32 = 0;
        move |input: I| {
            unsafe {
                println!("calling rec(), REC = {}", REC);

                REC += 1;
                if REC > 10 {
                    return Err(nom::Err::Failure(E::from_error_kind(input, ErrorKind::TooLarge)));
                }
            }
            let res = parser.parse(input);
            unsafe {
                REC -= 1;
            }
            res
        }

}

#[test]
fn recursion_limit() {
    let mut input: String = std::iter::repeat('(').take(10000).collect();
    input += "x";
    input.extend(std::iter::repeat(')').take(10000));
    println!("{:?}", cp(input.as_str()));
    println!("try again:");
    println!("{:?}", cp(input.as_str()));
}

I believe that could be a good approach if we can solve the following issues:

  • use a safer solution than the static mut
  • have a different counter for each parser that recurses (in JSON, hash and array would each have their own)
  • the recursing parser can be declared as a function: in the example I pass rec(cp) as argument, but if the recursion is several level deepers and done from multiple parsers, each calling point should use the same counter. The easiest would be to enerate a function that wraps it

I think that can only be done in a macro, but maybe there's a better way

Geal avatar Oct 11 '20 14:10 Geal

it would not handle having multiple parsers doing recursion

If you're referring to #1125, then I don't see why not? As long as the parsers share the same input instance, they have a piece of shared auxiliary data they can all look at to decide to stop the recursions.

kandykoma avatar Oct 11 '20 19:10 kandykoma

you might want to have different recursion limits depending on the parser, like some heavy top level objects have a depth of 3-4 max, but then lower objects get different limits

Geal avatar Oct 12 '20 07:10 Geal

Before I go on, I'd like to make it clear, that I'm not arguing that #1125 is and elegant, or a "good" solution. My original design goal was to keep the nom changes minimal, but still allow me to keep track of the recursion depth. Then @Keruspe showed me a better, non-intrusive solution :).

But, on the powerful-vs-useful spectrum[1] #1125 is on the powerful side. I can see why you would not want to use it, but you can implement arbitrarily complex algorithms with it. In the parser test I added a simple integer as aux data, but if you want you can add a pair of integers, one keeping count of the "heavy" object depth, and another counting the "light" object depths.

1: http://blog.higher-order.com/blog/2014/12/21/maximally-powerful/

kandykoma avatar Oct 12 '20 09:10 kandykoma

In order to avoid StackOverflow one might want to look at https://lib.rs/crates/stacker

homersimpsons avatar Oct 20 '20 17:10 homersimpsons

@homersimpsons Well, that turns the SO into an OOM kill, which may or may not be an improvement. When processing potentially hostile data though, I'd still prefer a limit on recursion depth and and explicit parsing failure I can feed to my IDS.

kandykoma avatar Oct 21 '20 06:10 kandykoma

I think that can only be done in a macro, but maybe there's a better way

Conditional compilation requires cfg-macro to define the data layout, so there is no way around. Even const-generics, which ideally can be optimised away in structures need at some point to conditionally set the const value.

I would not bet on the compiler to optimise away the counter in each function though, when there is no concept to set a const value to unused (in zig you have the value undefined for that during comptime evaluation).

matu3ba avatar Jan 14 '21 20:01 matu3ba

This would be great. It's tough to work around it right now, as it's also hard to thread state through the nom combinators. In our project we had to resort to a thread-local to keep track the recursion count.

arthurprs avatar Mar 17 '22 08:03 arthurprs