wikibase-dump-filter icon indicating copy to clipboard operation
wikibase-dump-filter copied to clipboard

Try simdjson to speed up parsing

Open nichtich opened this issue 5 years ago • 3 comments

See https://github.com/luizperes/simdjson_nodejs

nichtich avatar Jun 10 '19 13:06 nichtich

did you try it? I made a test branch, but couldn't find much performance boost: on my machine, parsing 100000 entities took 64s with simdjson in non-lazy mode, 55s with JSON.parse... that's a bit short to call it a benchmark, but it's not as encouraging as the module description promises ^^

maxlath avatar Jun 10 '19 17:06 maxlath

The performance comes in lazy mode which requires some deeper changes to the filter function. The basic idea is no minimize copying objects so only selected parts of the JSON structure are actually converted to JavaScript objects. This further requires to move splitting of the input stream into lines and removal of trailing commas into simdjson (similar to ja2l).

nichtich avatar Jun 10 '19 18:06 nichtich

I pointed the simdjson developers to our use case. This likely requires extensions to simdjson (and its node binding) but should definitely result in speedup.

nichtich avatar Jun 12 '19 07:06 nichtich