jsonld-context-parser.js feat: context cache

Creates a context caching mechanism. The use case of this is parsing numerous json-ld objects which all have the same context object; where context parsing is often the main bottleneck; see performance results below:

Parse a context that has not been cached; and without caching in place x 108 ops/sec ±0.32% (86 runs sampled) Parse a list of iri contexts that have been cached x 79,985 ops/sec ±0.44% (90 runs sampled) Parse a context object that has not been cached x 1,950 ops/sec ±1.36% (90 runs sampled) Parse a context object that has been cached x 7,637 ops/sec ±0.20% (91 runs sampled)

Oct 27 '23 09:10 jeswr

Pull Request Test Coverage Report for Build 6666516387

53 of 53 (100.0%) changed or added relevant lines in 3 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage decreased (-0.1%) to 99.904%

Totals
Change from base Build 6492290900:	-0.1%
Covered Lines:	570
Relevant Lines:	570

💛 - Coveralls

Oct 27 '23 09:10 coveralls

This looks very neat, thanks @jeswr!

Before I do an in-depth review, I have some general questions.

What is the overall performance impact of this when caching is not very valuable? Might be useful to run the parser perf checks again with an without this cache. I suspect hashing to have some overhead.
Do you have insights on memory consumption? My fear is that for parsers that are reused, many (potentially large) contexts can be cached. I know there's the LRU cache, but I wonder if we should consider disabling the cache by default (not sure tbh).
Do all JSON-LD spec tests still pass with this change?

Pinging @sdevalk, as this will interest him regarding https://github.com/rubensworks/rdf-dereference.js/issues/48

Oct 27 '23 11:10 rubensworks

What is the overall performance impact of this when caching is not very valuable? Might be useful to run the parser perf checks again with an without this cache. I suspect hashing to have some overhead.

In the worst case it seems to be an extra 50% overhead. It can be worked around by trying to work by-reference rather than by hash most of the time (see https://github.com/inrupt/solid-client-vc-js/blob/0f8ce276b6ea8a977b9d2ea189bc92385ef44b48/src/parser/jsonld.ts#L70-L111) however; the danger here is that you accidentally consume a large amount of memory by using contexts as keys so I'm leaving this for a subsequent piece of work once better benchmarking is in place and we have a good way of pruning them quickly.

FYI the custom caching mechanism in https://github.com/inrupt/solid-client-vc-js/blob/0f8ce276b6ea8a977b9d2ea189bc92385ef44b48/src/parser/jsonld.ts reduced the time on e2e tests from 20min to 20seconds.

Do you have insights on memory consumption? My fear is that for parsers that are reused, many (potentially large) contexts can be cached. I know there's the LRU cache, but I wonder if we should consider disabling the cache by default (not sure tbh).

I've disabled caching by default in https://github.com/rubensworks/jsonld-context-parser.js/pull/70/commits/b48cc072b401547d91af8b87bcbcb883c073b34f. So we will need to make an update to jsonld-streaming-parser to allow custom contexts / caches to be passed in.

Do all JSON-LD spec tests still pass with this change?

I'm not seeing rdf-test-suite as a dev dependency of this library or any commands to run spec testing. Do you have a script somewhere to run spec tests on this lib?

Oct 27 '23 11:10 jeswr

danger here is that you accidentally consume a large amount of memory by using contexts as keys so I'm leaving this for a subsequent piece of work once better benchmarking is in place and we have a good way of pruning them quickly.

I suspect that caching by reference would consume less memory than hashing, as caching is only done on pointers towards shared memory.

reduced the time on e2e tests from 20min to 20seconds.

Ooh, nice!

I'm not seeing rdf-test-suite as a dev dependency of this library or any commands to run spec testing. Do you have a script somewhere to run spec tests on this lib?

This will have to be tested by manually plugging this into jsonld-streaming-parser.

Oct 27 '23 11:10 rubensworks

as caching is only done on pointers towards shared memory.

My main concern here is that if we have a poor configuration of the lru-cache then we are preventing GC on context objects that have already been parsed and may no longer be relevant.

Oct 27 '23 11:10 jeswr

My main concern here is that if we have a poor configuration of the lru-cache then we are preventing GC on context objects that have already been parsed and may no longer be relevant.

I would suspect lru-cache to be very well battle-tested by now, but I don't know enough of its internals to make any hard claims :-)

Oct 27 '23 12:10 rubensworks

jsonld-context-parser.js jsonld-context-parser.js copied to clipboard

feat: context cache

Pull Request Test Coverage Report for Build 6666516387

💛 - Coveralls

jsonld-context-parser.js
jsonld-context-parser.js copied to clipboard