yaep icon indicating copy to clipboard operation
yaep copied to clipboard

Info on the fastest Earley impl

Open pczarn opened this issue 4 years ago • 16 comments

I made an Earley engine that is much faster in benchmarks than YAEP: gearley. Here are the results for 88248 tokens (10K lines of C).

gearley: bench_parse_c ... bench:  53,125,029 ns/iter (+/- 3,652,318)
gearley with CompactBocage: bench_parse_c ... bench:  87,122,025 ns/iter (+/- 2,744,845)
YAEP: parse time 0.38

Memory use

gearley: all:12669488b forest:12664888b
gearley with CompactBocage: all:2.217MB forest:2.212MB
YAEP: memory=12.724MB

So, the Earley sets take up only 4600 bytes. That's one byte every 19 tokens. Gearley with a compact forest takes only 17.4% of the memory YAEP does. However, a compact forest slows down the benchmark by 34 milliseconds. With normal forest, the parse takes 14% time compared to a parse with YAEP, and the same amount of memory.

Compact forest takes 25 bytes per token. Gearley with the default forest takes 602 ns per token.

I invite you to make your own benchmarks.

pczarn avatar Dec 01 '19 12:12 pczarn

Thank you for letting me know about your implementation. I'll play with it when I have a spare time. Right now I have a couple questions:

  • what algorithms does you implementation use (practical early parser, Joop Leo's approach, may be something else)?
  • how do you approach to error recovery in your parser?

vnmakarov avatar Dec 01 '19 23:12 vnmakarov

Sorry, I can not build your parser because rust compiler reports errors on your sources.

Also according to my benchmarking YAEP is 2.5-6 times slower than YACC on parsing big C files (60K and 600K C lines). It means that your parser is faster than YACC and this makes me suspect that something is wrong with your measurements or you measurement approach.

vnmakarov avatar Dec 01 '19 23:12 vnmakarov

My algorithm has three passes, like Jeffrey Kegler's. I am eliminating nullable symbols. I added some highly effective optimizations with a binary heap. Besides that, it is similar to Jeffrey Kegler's. I am not caching Earley sets like YAEP. My algorithm does not have Leo's optimization. If I implement it, it will be with a rewrite from right-recursive to left-recursive rules. I didn't implement error recovery in my parser yet.

You can build my parser using the nightly compiler. It probably does not work on stable. You can switch using rustup default nightly with rustup.

I believe my results are correct, because 602 ns per token is not too little.

pczarn avatar Dec 02 '19 08:12 pczarn

Do the YAEP/YACC measurements include the definition and processing of a grammar, that is e.g. building the LALR(1) tables? My benchmarks include processing of the grammar.

pczarn avatar Dec 02 '19 19:12 pczarn

My algorithm has three passes, like Jeffrey Kegler's. I am eliminating nullable symbols. I added some highly effective optimizations with a binary heap. Besides that, it is similar to Jeffrey Kegler's. I am not caching Earley sets like YAEP. My algorithm does not have Leo's optimization. If I implement it, it will be with a rewrite from right-recursive to left-recursive rules. I didn't implement error recovery in my parser yet.

Thank you for your parser description.

You can build my parser using the nightly compiler. It probably does not work on stable. You can switch using rustup default nightly with rustup.

OK.

I believe my results are correct, because 602 ns per token is not too little.

I just run the same test you use. I used i7-8700K under Fedora Core 28:

test bench_parse_c ... bench:  34,597,306 ns/iter (+/- 131,300) # your parser
./a.out 1 0 < /home/vmakarov/GITHUB/gearley/benches/part_gcc_test.i # YAEP wit static lookahead
scanner time 0.0040, memory=2508.0kB
parse time 0.0314, memory=12728.0kB

So the speed of the parser is pretty the same (31ms vs 35ms).

vnmakarov avatar Dec 02 '19 19:12 vnmakarov

Do the YAEP/YACC measurements include the definition and processing of a grammar, that is e.g. building the LALR(1) tables? My benchmarks include processing of the grammar.

I don't build any LA(LR)(k) tables. The grammar is taken as yacc-like file. The measurement I've just posted does not include processing grammar (yaep_parse_grammar).

Actually after running 10 times the test, the minimal time I have:

time ./a.out 1 0 < /home/vmakarov/GITHUB/gearley/benches/part_gcc_test.i
scanner time 0.0037, memory=2508.0kB
parse time 0.0274, memory=12728.0kB

real    0m0.034s
user    0m0.027s
sys     0m0.006s

In any case, the speed can be vary a bit depending what we are taking into consideration but it is not 7 times difference you posted.

vnmakarov avatar Dec 02 '19 19:12 vnmakarov

Oh okay. The similar times and memory usages make a lot of sense. I will try to figure out why YAEP was so slow in my runs. To optimize, I would have to cut down on the number of completions. That's the only way to optimize. My code spends like, 45% time on doing completions.

I also have an idea of writing a parser in assembly. I wonder if SIMD could help with performance.

Could you try my benchmark without grammar processing, so that it is an apples to apples comparison? You can do that by moving the let cfg = InternalGrammar::from_grammar(&external); line up, out of bench.iter.

pczarn avatar Dec 02 '19 21:12 pczarn

I didn't mean to say you use LA(LR)(k), I meant to say YACC does.

In case I incorporated LL(1) into my parser, would you consider it faster, or would you consider that cheating?

I guess I might cache completions, because I detected 1000 unique completions for these 10K lines.

pczarn avatar Dec 03 '19 20:12 pczarn

I didn't mean to say you use LA(LR)(k), I meant to say YACC does.

Sorry, I probably did not understand you (your messages seem too vague for me). YAEP uses YACC to generate a parser for YAEP grammar description. This grammar parser work is not included in parse time 0.0274, memory=12728.0kB. But its work is included in real 0m0.034s in my measurement reported above.

You can see how YEAP ./a.out mentioned above is built and how the time is measured in compare_parser.tst.in.

In case I incorporated LL(1) into my parser, would you consider it faster, or would you consider that cheating?

I don't understand what you are trying to say. I have no intention to compete with you. Let an user decide what to use gearley, MARPA, or YAEP. The more implementations, the better. I probably add your implementation to the script for the comparison when your implementation will be stabilized and released as I did it for MARPA.

I guess I might cache completions, because I detected 1000 unique completions for these 10K lines.

Good, that is what my parser does. It can speed up your implementation too.

vnmakarov avatar Dec 03 '19 22:12 vnmakarov

My attempts at optimization failed so far. I would like to optimize by caching all computations for an Earley set. I will hash all Earley sets individually and compare hashes of used sets. Are you doing this in YAEP?

pczarn avatar Feb 01 '20 13:02 pczarn

My attempts at optimization failed so far. I would like to optimize by caching all computations for an Earley set. I will hash all Earley sets individually and compare hashes of used sets. Are you doing this in YAEP?

I wrote YAEP about 20 years ago. So my current knowledge may be not accurate. I don't remember well terminology too.

I tried to cache full LR-sets but it did not work. The problem are in distances. They are varying a lot.

So I divide LR-sets on LR-situattions and separately vector of distances. LR-set situations w/o distances are repeated many times. So in the working set I have only a LR-set reference and vector of distances for corresponding LR-situations in the LR-set. Moreover I keep only minimal part of LR-set situations as other LR-set situations can be built from them. Zero distances for the built LR-set situations can be omitted too.

For detail info, you probably need to look at the source code. Code for building the work set should be pretty straightforward. Code for building parsing trees is more complicated.

I hope this info can help you.

vnmakarov avatar Feb 02 '20 16:02 vnmakarov

Thank you.

I'm currently implementing this new optimization with set hashing and a Patricia trie.

I found another small optimization: Earley items can be shrunk by 1/3 by tying the ordering of SPPF links to the ordering of item origins. (I sort items by origin in two places.) Then you remove the origins and use SPPF links instead. Now, Earley items take 8 bytes. It has yet to be seen if this results in a performace and/or memory use improvement.

pczarn avatar Mar 14 '20 18:03 pczarn

What you call distances, I call item origins after Jeffrey Kegler. I found that distances can be removed with very little complication for the algorithm.

I found that I can retrieve 7000 out of 8500 Earley items from cache in a small test. The other 1500 sets need unique computations to produce.

pczarn avatar Mar 14 '20 18:03 pczarn

Sorry for slow response. I was (and am still) a bit busy.

Thank you for the update. Although I did not work on this topic for a long time, it is still interesting to me.

I am not familiar with Jeffrey Kegler's implementation although he did a lot for recent Earley's parser popularization and I should have paid more attention to his approach.

Considering meaning of origin, I suspect distance is a bit different. Probably origin is absolute number of the set (in working list), distances are relative numbers. As parsed string may have pretty similar substrings, the distances can be the same and origins don't. So vector of distances can have a big probability to be present many times and can be stored in one exemplar. But may be I am wrong and his origins are the same as my distances.

vnmakarov avatar Mar 16 '20 18:03 vnmakarov

New measurements show that 610 out of 12600 Earley sets can be cached, on the C benchmark. The cache is huge, so probably not worth using.

I will try to cache completions.

pczarn avatar Mar 23 '20 13:03 pczarn

My benchmarks include processing the grammar.

New measurements were wrong, I can skip processing for 7000 out of 8500 Earley items. I'm having some issues with the Patricia trie implementation, but I will resolve these and get new benchmarks running within a few days.

pczarn avatar Apr 03 '20 16:04 pczarn