qEndpoint icon indicating copy to clipboard operation
qEndpoint copied to clipboard

GH-600 parallel parsing of NQUADS and N-Triples

Open hmottestad opened this issue 8 months ago • 3 comments

Issue resolved (if any): #600

Description of this pull request:


Please check all the lines before posting the pull request:

  • [ ] I've created tests for all my changes
  • [ ] My pull request isn't fixing or changing multiple unlinked elements (please create one pull request for each element)
  • [ ] I've applied the code formatter (mvn formatter:format on the backend, npm run format on the frontend) before posting my pull request, mvn formatter:validate to validate the formatting on the backend, npm run validate on the frontend
  • [ ] All my commits have relevant names
  • [ ] I've squashed my commits (if necessary)

hmottestad avatar Mar 20 '25 14:03 hmottestad

Timed conversion of latest-lexemes.nt.gz from https://dumps.wikimedia.org/wikidatawiki/entities/ . Tested on an M3 Max with 16 cores. Originally 11 minutes, now 7 minutes.

Before

Screenshot 2025-03-20 at 14 38 55

After

Screenshot 2025-03-20 at 14 55 11

hmottestad avatar Mar 20 '25 14:03 hmottestad

A few of the tests assumed that the RDF parser would return statements in a fixed and predictable order.

I fixed up a couple of them, but then found out that it's probably best to have a way to enable/disable parallel parsing.

Now all the tests are passing, but I'll need to double check the performance now to see that it's still as good as expected.

Can you start testing it @ate47 ?

hmottestad avatar Mar 22 '25 09:03 hmottestad

I think you can also get a look at the ExceptionThread class, I've made it to bind threads together while keeping track of the exceptions.

ate47 avatar May 15 '25 12:05 ate47