instaparse icon indicating copy to clipboard operation
instaparse copied to clipboard

Performance problem

Open russellw opened this issue 9 years ago • 9 comments

There seems to be a performance problem, though I'm not sure whether it's intrinsic or something I'm doing wrong:

http://stackoverflow.com/questions/36572997/any-way-to-speed-up-instaparse

russellw avatar Apr 12 '16 14:04 russellw

There are a bunch of performance notes here: https://github.com/Engelberg/instaparse/blob/master/docs/Performance.md

Usually an instaparse parser performs linearly with respect to the size of the input up to some point at which it exhausts memory and starts thrashing the garbage collector. Usually this point where it breaks down is when the input gets somewhere between 20k-200k, depending on the complexity of the grammar. (If the parser performance isn't linear on small inputs, then the culprit is usually ambiguity in the grammar.)

So a 700k file is really pushing the limits of what instaparse is good for. The reason is that instaparse has really robust backtracking, so it has to consider the possibility that the very last character of the file may force it to reinterpret the way that the entire file is matched against the grammar, conceivably backtracking all the way back to the beginning if necessary. This means maintaining a history of every decision point made for the entire 700k of parsing, which is quite a bit to track.

If the 700k file is comprised of a bunch of individual records, and it is easy to identify the boundaries, the best thing to do is to chop them up into individual records and pass these smaller strings to your instaparse parser.

You can also try to use the :optimize :memory flag (see the Performance doc for more info), which attempts to automate the above strategy, throwing away backtracking history each time a chunk is successfully parsed. That strategy doesn't work on all grammars, but I think with your grammar, it probably would. It's certainly worth a try.

If you get a chance to try the :optimize :memory flag, I'd be interested to know whether it helped. When I first released instaparse, I was mainly thinking in terms of parsing small-ish strings and source-code-sized files (which tend to be under 20k). So it is useful for me to get good test samples of large files in the wild, so I can identify and implement strategies which allow instaparse to perform well for those sizes. :optimize :memory was my first foray in that direction.

Engelberg avatar Apr 13 '16 03:04 Engelberg

BTW, here's a simple example illustrating my point about backtracking. Imagine the following grammar:

S = Option1 | Option2 Option1 = 'a'+ Option2 = 'a'+ 'b'

If you imagine matching this grammar against a 700k string of all a's up until the final 'b', instaparse would first try matching using Option1, and it would hum along merrily until it hit the final 'b', at which point it has to backtrack all the way to the beginning and try Option2.

This is why most parsing engines require you to write a LL(k) grammar which can determine with certainty what rule to use based on looking ahead at most k characters.

Beyond :optimize :memory, one of my ideals would be to allow the programmer to say "I know this is an LL(1) grammar" or "I know this can be parsed correctly with a packrat strategy", and use an even more efficient underlying strategy. This would allow someone to use the same convenient grammar description and toggle instantly between a bunch of different engines to see what works, with the current engine representing the most flexible parsing strategy.

Engelberg avatar Apr 13 '16 03:04 Engelberg

Ah! okay, I'll switch to handwritten parsers then - I have some files up to hundreds of megabytes to deal with. But can you put a note maybe near the start of the readme that instaparse is only intended for small files?

On Wed, Apr 13, 2016 at 4:01 AM, Mark Engelberg [email protected] wrote:

There are a bunch of performance notes here: https://github.com/Engelberg/instaparse/blob/master/docs/Performance.md

Usually an instaparse parser performs linearly with respect to the size of the input up to some point at which it exhausts memory and starts thrashing the garbage collector. Usually this point where it breaks down is when the input gets somewhere between 20k-200k, depending on the complexity of the grammar. (If the parser performance isn't linear on small inputs, then the culprit is usually ambiguity in the grammar.)

So a 700k file is really pushing the limits of what instaparse is good for. The reason is that instaparse has really robust backtracking, so it has to consider the possibility that the very last character of the file may force it to reinterpret the way that the entire file is matched against the grammar, conceivably backtracking all the way back to the beginning if necessary. This means maintaining a history of every decision point made for the entire 700k of parsing, which is quite a bit to track.

If the 700k file is comprised of a bunch of individual records, and it is easy to identify the boundaries, the best thing to do is to chop them up into individual records and pass these smaller strings to your instaparse parser.

You can also try to use the :optimize :memory flag (see the Performance doc for more info), which attempts to automate the above strategy, throwing away backtracking history each time a chunk is successfully parsed. That strategy doesn't work on all grammars, but I think with your grammar, it probably would. It's certainly worth a try.

If you get a chance to try the :optimize :memory flag, I'd be interested to know whether it helped. When I first released instaparse, I was mainly thinking in terms of parsing small-ish strings and source-code-sized files (which tend to be under 20k). So it is useful for me to get good test samples of large files in the wild, so I can identify and implement strategies which allow instaparse to perform well for those sizes. :optimize :memory was my first foray in that direction.

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/Engelberg/instaparse/issues/131#issuecomment-209204750

russellw avatar Apr 13 '16 11:04 russellw

I'm pretty sure I've hit the limit. One more rule and running the parser thrashes my machine possibly forever, never returning. Using :optimize :memory fixed it. However because this option is only available for insta/parse (not insta/parses), I now loose the ability to detect multiple possibilities as soon as they happen. I think the solution I'll end up coming to is what you recommend - doing a high/root level parse at the beginning, handing off various bits to more specific parsers. And some of these more specific parsers will be used multiple times.
I've embarked on that root level parser strategy a bit - the problem I'm finding is that it is difficult for me to parse through 'everything except "END_ROUTINE"'. Is it advised to break up the original source file using a hand written code that just finds the locations of particular text markers, say using index-of and subs?

I used index-of and subs and was able to break it up. Smaller parsers will be easier to maintain.

chrismurrph avatar Aug 28 '16 00:08 chrismurrph

My instinct would be to use re-seq with an appropriate regex to break it apart. In any case, this strategy of making a high level parse and then handing chunks off to more specific parsers is something I'd like to make easier to do in instaparse. If you have any tips that you've learned from going through it, let me know.

Engelberg avatar Aug 31 '16 09:08 Engelberg

I looked at all the regex functions and none of them tell you the index in the String. I need indexes because simply want to find 'BEGIN_BLOCK' and 'END_BLOCK' - then find the whole of the middle in between. Perhaps real problem is I'm not good enough at regular expressions to match ['BEGIN_BLOCK' - anything but 'END_BLOCK' - 'END_BLOCK'].

chrismurrph avatar Sep 01 '16 02:09 chrismurrph

#"BEGIN_BLOCK[\s\S]*?END_BLOCK" would do the trick.

The key is the *? combination which is non-greedy, meaning it stops at the first END_BLOCK it gets to, rather than the last.

Also, if you want indexes, you can call Clojure's re-matcher which will return an instance of the Matcher class and you can use re-find to advance the regex and then call the methods .start and .end on it to get the indexes.

=> (def m (re-matcher #"begin[\s\S]*?end" "begin this is my text end begin
and more text end"))
#'interleave.core/m
=> (re-find m)
"begin this is my text end"
=> (.start m)
0
=> (.end m)
25
=> (re-find m)
"begin and more text end"
=> (.start m)
26
=> (.end m)
49

On Wed, Aug 31, 2016 at 7:46 PM, Chris Murphy [email protected] wrote:

I looked at all the regex functions and none of them tell you the index in the String. I need indexes because simply want to find 'BEGIN_BLOCK' and 'END_BLOCK' - then find the whole of the middle in between. Perhaps real problem is I'm not good enough at regular expressions to match ['BEGIN_BLOCK' - anything but 'END_BLOCK' - 'END_BLOCK'].

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Engelberg/instaparse/issues/131#issuecomment-243961049, or mute the thread https://github.com/notifications/unsubscribe-auth/AAIdS25U3Lh7BYKx0qKPU0nww76D15eOks5qljyOgaJpZM4IFcX1 .

Engelberg avatar Sep 01 '16 07:09 Engelberg

One thing you might be interested in (in case you don't already know) is the concept of Cuts from http://www.lihaoyi.com/fastparse/. Seems to me to be like :optimize :memory, but not right from the outset - rather from where decides to draw a line upon the parsing up to now.

chrismurrph avatar Sep 02 '16 04:09 chrismurrph