arcsecond icon indicating copy to clipboard operation
arcsecond copied to clipboard

Expression Parse Bomb

Open poteat opened this issue 3 years ago • 5 comments

The recommended code in the Cookbook for parsing arithmetic expressions introduces an exponential-complexity parse bomb, which I have found does actually become extremely relevant when using Arcsecond for parsing realistic expressions.

Demonstration code (converted to Typescript)

Suffice to say, the example input "9 + ((((((((((5)))))))))) - 4 * 4 / 3" takes twenty seconds on my machine. A contrived example, but again, I have seen unbounded parsing time for reasonable inputs as well.

I have not yet been unable to wrap my head around where the inefficiency is coming from, or how to structure the parser to avoid the bomb. I really enjoy using this flavor of parser, and would be disappointed at having to write a parser manually to avoid this.

If I find out how to fix this, I'll update here so that we might update the Cookbook. Thanks again Francis for putting this library together - I'm super excited about your rewrite in Typescript.

poteat avatar Aug 09 '21 01:08 poteat

@IMax153 I'm aware you're the original source of the example Cookbook code - I'll take a look myself in your upstream work, but did you ever find a solution for this?

poteat avatar Aug 09 '21 01:08 poteat

Update: It seems this is likely a fundamental issue with parser combinators that is resolved by using packrat parsing - a potential fix is quite straight-forward, inside of the Parser constructor:

this.p = memoize(p, (state) => JSON.stringify(state));

poteat avatar Aug 09 '21 02:08 poteat

@poteat - if I remember correctly, I did not guard well against infinite recursion, which is likely the problem here.

IMax153 avatar Aug 09 '21 02:08 IMax153

@IMax153 Well, that may be true, although I never ran into any infinite recursion issues with your example code. I am confident that your code is very correct actually, for this reason:

I implemented packrat parsing on our downstream fork, and it leads to sizeable performance improvements, and makes parse time linear with respect to input size.

https://github.com/Volley-Inc/arcsecond/commit/73d60d138ee846a15c4ba4d7fdec79e5b646e812

We see orders-of-magnitude improvement actually - it would be really cool if a similar fix was introduced in the base package. I'll wait for a maintainer to vet the idea though before I put a PR together.

poteat avatar Aug 09 '21 04:08 poteat

Ahh this is really interesting @poteat! Having just read a bit about packrat parsing, the optimisation seems like a definite win. Took a quick look at memoize-clear - it's only about 40 lines and not doing anything wildly complicated. In the interest of keeping arcsecond dependency free, I think it would be good to have our own implementation.

francisrstokes avatar Aug 10 '21 08:08 francisrstokes