jsonld-context-parser.js
jsonld-context-parser.js copied to clipboard
feat: context cache
Creates a context caching mechanism. The use case of this is parsing numerous json-ld objects which all have the same context object; where context parsing is often the main bottleneck; see performance results below:
Parse a context that has not been cached; and without caching in place x 108 ops/sec ±0.32% (86 runs sampled) Parse a list of iri contexts that have been cached x 79,985 ops/sec ±0.44% (90 runs sampled) Parse a context object that has not been cached x 1,950 ops/sec ±1.36% (90 runs sampled) Parse a context object that has been cached x 7,637 ops/sec ±0.20% (91 runs sampled)
Pull Request Test Coverage Report for Build 6666516387
- 53 of 53 (100.0%) changed or added relevant lines in 3 files are covered.
- No unchanged relevant lines lost coverage.
- Overall coverage decreased (-0.1%) to 99.904%
Totals | |
---|---|
Change from base Build 6492290900: | -0.1% |
Covered Lines: | 570 |
Relevant Lines: | 570 |
💛 - Coveralls
This looks very neat, thanks @jeswr!
Before I do an in-depth review, I have some general questions.
- What is the overall performance impact of this when caching is not very valuable? Might be useful to run the parser perf checks again with an without this cache. I suspect hashing to have some overhead.
- Do you have insights on memory consumption? My fear is that for parsers that are reused, many (potentially large) contexts can be cached. I know there's the LRU cache, but I wonder if we should consider disabling the cache by default (not sure tbh).
- Do all JSON-LD spec tests still pass with this change?
Pinging @sdevalk, as this will interest him regarding https://github.com/rubensworks/rdf-dereference.js/issues/48
What is the overall performance impact of this when caching is not very valuable? Might be useful to run the parser perf checks again with an without this cache. I suspect hashing to have some overhead.
In the worst case it seems to be an extra 50% overhead. It can be worked around by trying to work by-reference rather than by hash most of the time (see https://github.com/inrupt/solid-client-vc-js/blob/0f8ce276b6ea8a977b9d2ea189bc92385ef44b48/src/parser/jsonld.ts#L70-L111) however; the danger here is that you accidentally consume a large amount of memory by using contexts as keys so I'm leaving this for a subsequent piece of work once better benchmarking is in place and we have a good way of pruning them quickly.
FYI the custom caching mechanism in https://github.com/inrupt/solid-client-vc-js/blob/0f8ce276b6ea8a977b9d2ea189bc92385ef44b48/src/parser/jsonld.ts reduced the time on e2e tests from 20min to 20seconds.
Do you have insights on memory consumption? My fear is that for parsers that are reused, many (potentially large) contexts can be cached. I know there's the LRU cache, but I wonder if we should consider disabling the cache by default (not sure tbh).
I've disabled caching by default in https://github.com/rubensworks/jsonld-context-parser.js/pull/70/commits/b48cc072b401547d91af8b87bcbcb883c073b34f. So we will need to make an update to jsonld-streaming-parser
to allow custom contexts / caches to be passed in.
Do all JSON-LD spec tests still pass with this change?
I'm not seeing rdf-test-suite
as a dev dependency of this library or any commands to run spec testing. Do you have a script somewhere to run spec tests on this lib?
danger here is that you accidentally consume a large amount of memory by using contexts as keys so I'm leaving this for a subsequent piece of work once better benchmarking is in place and we have a good way of pruning them quickly.
I suspect that caching by reference would consume less memory than hashing, as caching is only done on pointers towards shared memory.
reduced the time on e2e tests from 20min to 20seconds.
Ooh, nice!
I'm not seeing rdf-test-suite as a dev dependency of this library or any commands to run spec testing. Do you have a script somewhere to run spec tests on this lib?
This will have to be tested by manually plugging this into jsonld-streaming-parser.
as caching is only done on pointers towards shared memory.
My main concern here is that if we have a poor configuration of the lru-cache
then we are preventing GC on context objects that have already been parsed and may no longer be relevant.
My main concern here is that if we have a poor configuration of the lru-cache then we are preventing GC on context objects that have already been parsed and may no longer be relevant.
I would suspect lru-cache
to be very well battle-tested by now, but I don't know enough of its internals to make any hard claims :-)