ddlog icon indicating copy to clipboard operation
ddlog copied to clipboard

DDlog parser needs a rewrite using a faster parser combinator library

Open netj opened this issue 9 years ago • 10 comments

I first thought it was a JVM booting latency issue and optimized that a bit, but didn't help much (1-2 seconds saving after first launch). I just did a dumb profiling, and it shows the parser is unreasonably slow, taking more than 99% of the time (10.70s/10.75s and 11.37s/11.48s), and unfortunately done twice from deepdive's compiler.

2016-01-22 04:45:29.792846 Parsing DeepDive application (/afs/cs.stanford.edu/u/netj/lfs1/dd-genomics) to generate:
2016-01-22 04:45:29.792864  run/compiled/schema.json
2016-01-22 04:45:29.792882   from app.ddlog
2016-01-22 04:45:29.792895 parsing
2016-01-22 04:45:40.497900 parsed
2016-01-22 04:45:40.540621 handled
2016-01-22 04:45:40.541898  run/compiled/deepdive.conf
2016-01-22 04:45:40.542026   from app.ddlog
2016-01-22 04:45:40.560282 parsing
2016-01-22 04:45:51.934837 parsed
2016-01-22 04:45:52.046044 handled

The ddlog code is not ridiculously large, ~1411 lines including comments:

$ wc app.ddlog 
 1411  4710 55034 app.ddlog

According to FastParse, another parser combinator library, maybe that's an expected speed for Scala's parser combinator we're currently using.

Up to 1/5 the speed of a hand-written parser, 100x faster than scala-parser-combinators, comparable (though slightly slower than) Parboiled2

So, it's definitely the right time to rewrite/migrate the DDlog parser with a new library. With 100x speedup, most things should be done under a second, and my first JVM optimization will become worthwhile.

netj avatar Jan 22 '16 13:01 netj

@feiranwang Do you have free cycles to take a crack at this asap? @HazyResearch/genomics guys are heavily affected by this.

netj avatar Jan 22 '16 13:01 netj

yikes... this is insane. We need to fix this asap...

chrismre avatar Jan 22 '16 13:01 chrismre

Sry might this be because of a bloated run/ directory? One thing I noticed is that on "old" dd directories, the run can take ages, but if I clone the thing and compile, it's fast enough for normal operation by far

Colossus avatar Jan 22 '16 15:01 Colossus

Whoever is interesting in profiling should go on raiders7 and copy one of my dd-genomics directories /lfs/raiders7/0/jbirgmei/dd-genomics2 and try around there

Colossus avatar Jan 22 '16 17:01 Colossus

Looking into this. Parsing this file takes ~8s locally. I'll check out FastParse.

feiranwang avatar Jan 22 '16 19:01 feiranwang

I'm almost sure this is not an issue with the parser itself. There is no way to write an ANTLR grammar that translates to something that slow

Colossus avatar Jan 22 '16 19:01 Colossus

The combinators don't produce the more traditional LR or LALR parser but just give a recursive descendent parser, hence the slowness is very plausible.

netj avatar Jan 22 '16 20:01 netj

I rewrote the parser using parboiled2, and tried on the genomics. It takes about ~400ms to parse the app.ddlog, compared to ~8s using scala parser combinator.

feiranwang avatar Jan 27 '16 02:01 feiranwang

@feiranwang You can probably also plug a bit about what happened when the input was repeated 10x?

netj avatar Jan 27 '16 02:01 netj

When the input was repeated for 10 times, the new parser finished in ~1s, but the scala parser combinator took more than 5min...

feiranwang avatar Jan 27 '16 03:01 feiranwang