rdflib.js The parsing of larger TTL files seems to take a big performance hit from v1.2.x on

The parsing of larger TTL files seems to take a big performance hit from v1.2.x on

Open HugaertsDries opened this issue 5 years ago • 6 comments

When trying to upgrading from v1.1.x to v1.2.x, I noticed a big performance hit when parsing files larger then 100 kB (122.9kb to be exact). Did something change in how files should be parsed?

The code used is a variant of the following:

import { graph as rdflibGraph, parse as rdflibParse } from 'rdflib';

const SOURCE_GRAPH = 'http://data.lblod.info/graphs/submission';

export function parse(sourceTtl)
    let store = rdflibGraph();
    rdflibParse(sourceTtl, this.store, SOURCE_GRAPH, 'text/turtle');
    return store;
}

Thx in advance!

May 13 '20 14:05 HugaertsDries

Hmm, I suspect this might be tied to my changes in https://github.com/linkeddata/rdflib.js/commit/6d6284f2a18a98b8fad38a3ad812650f074507d2 =\ I don't know if you're able to test?

@timbl Maybe you have capacity?

May 13 '20 15:05 megoth

The additional call to canon() you suggested. Sounded that could be it.

May 15 '20 07:05 timbl

@megoth any suggestions on how I could test it?

May 15 '20 15:05 HugaertsDries

You could comment out the calls to canon and the update to this.index in src/store.ts in this function:

  add (
    subj: Quad_Subject | Quad | Quad[] | Statement | Statement[],
    pred?: Quad_Predicate,
    obj?: Term | string,
    why?: Quad_Graph
  ): Quad | null | IndexedFormula ...

If you want to play around directly with the JS (avoid the babel step), you could look in lib/store.js for

    key: "add",
    value: function add(subj, pred, obj, why) ...

If you're in a browser, you may want to disable minification by adding this to webpack.config.js:

optimization: {minimize: false},

May 15 '20 17:05 ericprud

Hi everyone, any update on this? I'm also experiencing some significant performance hits (up to 10x slower) when parsing large XML files (tens of MB).

For instance, the NAL thesaurus (https://agclass.nal.usda.gov/downloads/NAL_Thesaurus_2020_SKOS.zip?agree3=on&image.x=45&image.y=15) takes more than 2 minutes to parse on my laptop, while it used to take 20/30 seconds on previous versions (I was on 1.0.6 before upgrading to 1.2.2).

Jun 05 '20 15:06 TommasoBianchi

Up.

Jul 28 '20 13:07 TommasoBianchi

rdflib.js rdflib.js copied to clipboard

The parsing of larger TTL files seems to take a big performance hit from v1.2.x on

rdflib.js
rdflib.js copied to clipboard