Policy on Agent/LLM-written code
Hi,
We've been using this package for a long time in a number of projects. We've found that some methods get quite slow on large documents, especially in jsonld.flatten.
I've had partial success performing some profiling of the code and getting some analysis from GPT Codex, which shows some good points to work on to optimise this method and other util methods.
I'd be happy to contribute some fixes back, but wanted to check if you had a policy on accepting code co-written with an agent. If you prefer to keep this code out of the project then no problem, I'd be happy to take some profile outputs and contribute some own-written code.
Here are some of the areas that Codex suggested to work on when I asked it to look for inefficient parts of the algorithm - not specifically related to the question in this issue, but I'd be happy to break it out into a few other issues related to algorithm optimization if it's useful.
- lib/jsonld.js (lines 476-478) – jsonld.frame expands the frame (expandedFrame) and then immediately walks every original frame key, re-running _expandIri just to see if anything aliases
@graph. That repeats context lookups for every property even though the expanded frame already has canonical keywords. Checking expandedFrame (or caching a key→expanded-key map once) would avoid the redundant O(m) re-expansion step on large frames. - lib/jsonld.js (lines 369-378) – jsonld.flatten always calls jsonld.expand even if the caller already provided expanded input; there’s no skipExpansion path like the other APIs expose. On big documents flattened repeatedly or fed from an existing expand pass, we pay the cost (and memory churn) of re-expansion every time. Honor a skipExpansion option and pass the data directly to _flatten when possible.
- lib/frame.js (lines 524-575) – _filterSubject repeatedly calls util.getValues(frame, key) and related helpers while it examines each subject. getValues clones arrays ([].concat), so we copy the same frame array once per subject and sometimes twice per property (lines 527 & 571). Converting each frame entry to a cached object (or precomputing Sets for items like
@id/@type) when the frame is parsed would eliminate those per-subject scans and array copies, turning the matching loop from O(n*m) cloning into O(n+m). - lib/frame.js (lines 347-360) & 720-732 – Both _cleanupNull and _cleanupPreserve store options.link[id] as an array and use indexOf to test whether a node has already been visited. When the same node is referenced k times, those membership tests become O(k) and the cleanup step degrades toward quadratic. Tracking each id’s visited nodes in a Set (or even a Map of object identity) removes the repeated scans.
- lib/frame.js (lines 735-739) – options.bnodesToClear is left as a plain array and .includes is executed for every property encountered during _cleanupPreserve. If there are many blank nodes, this is another O(n²) hotspot. Converting the list to a Set immediately after jsonld.frame computes it (before passing into _cleanupPreserve) brings the membership test down to O(1).
- lib/flatten.js (lines 24-35) – _flatten always sorts Object.keys(defaultGraph) before emitting nodes. Sorting gives deterministic output but costs O(n log n) even though JSON-LD flattening does not require order guarantees. On very large graphs the sort dominates runtime; consider leaving the keys in insertion order by default (or making sorting optional) to keep the operation linear.
Open questions / risks
If deterministic output is required somewhere else (e.g., tests relying on sorted flatten results), skipping the sort may need to be controlled via an option. Converting the various tracking arrays to Sets changes insertion order; confirm nothing depends on array ordering when walking options.link.
And when I asked it to focus specifically on createNodeMap, hasValue, and isObject:
- lib/frame.js (lines 44-64) rebuilds the node map on every jsonld.frame invocation (_createNodeMap + optional _mergeNodeMapGraphs). When the input was already flattened or when multiple frames are applied to the same expanded input, there’s no reuse hook, so the expensive walk happens repeatedly. Accepting a pre-built nodeMap (or caching state.graphMap keyed by the expanded input hash) would avoid this repeated O(n) traversal that shows up in the profiler.
- lib/frame.js (line 53) and again at lib/frame.js (lines 98-100) sort the subject ids for every recursion level. Sorting millions of ids in every nested frame results in O(n log n) overhead on top of the framing work, even though JSON-LD framing does not require deterministic ordering. If ordering is needed only for specific code paths, consider either (1) only sorting once at the root, or (2) making it optional so big documents can stay linear.
- _filterSubject in lib/frame.js (lines 524-633) calls util.getValues twice per key (nodeValues and the frame copy). getValues clones the array ([].concat(...)), so matching each subject involves allocating two new arrays per property before any comparisons happen. Reaching directly into subject[key] / frame[key] (or caching the frame values as immutable structures during expansion) would avoid the quadratic garbage creation that the profiler flags under types.isObject and friends.
- _filterSubject also scans arrays repeatedly: e.g.,
frame['@id'].includes(...)andframe['@type']loops linearly for each subject. Precomputing a Set per@id/@typeclause when parsing the frame would reduce matching from O(#frameValues × #subjects) down to O(#subjects). The same idea applies to wildcard default tests – we can stash flags once instead of recomputing booleans each time. - lib/nodeMap.js (lines 111-214) sorts Object.keys(input) for every subject encountered during _createNodeMap. That guarantees deterministic property order but converts the per-node work from O(p) to O(p log p). On huge graphs this dominates CPU. Unless canonical ordering is explicitly required here, iterating the keys as-is or putting the sort behind a options.sortProperties flag would eliminate large amounts of time spent just ordering object keys.
- Still within _createNodeMap, every edge insertion calls util.addValue(... {allowDuplicate: false}), which immediately invokes util.hasValue to scan the existing array (lib/util.js (lines 208-224)). In dense graphs those arrays grow large, so inserting k references into a property becomes O(k²). Tracking an auxiliary Set of already-added ids per subject/property (or deduplicating once at the end) would let insertions stay near O(1).
- util.hasValue + util.compareValues (lib/util.js (lines 208-260) and 286-318) do repeated structural comparisons every time we insert or test values, and compareValues leans on types.isObject (profiling hotspot). When the property values are subject references we really only care about
@id, so short‑circuiting that case (or storing references in a Map keyed by@id) would avoid most of the expensive deep comparisons entirely. - types.isObject (lib/types.js (lines 46-63)) uses Object.prototype.toString for every check. It shows up heavily because we call it inside every comparison/path. A cheaper inline check (v !== null && typeof v === 'object' && !Array.isArray(v)) would remove the extra function call and string comparison. Consider inlining this fast path where tight loops (framing & node-map building) depend on it.
Thanks for the analysis. I think we're in the same thinking as many other projects and people that we are not quite sure how to handle LLM generated code at the moment. for various reasons. It would be good to try and write code and patches by hand at the moment, if possible. Using the LLM tooling for analysis is more ok, but that often comes with suggestions, which will be hard to ignore. I imagine it will soon be impossible to avoid using all those tools for everything.
Every time I work on internal bits of this code I see possible spots for optimization. Probably including some of the above. Haven't had the time to really check which of those might really matter. Pretty sure there's lots of duplicate iri expansion that could be improved by caching or passing around the results in some cases.
There is some benchmark tooling built that was anticipating doing optimization work. That of course is an art on its own. A bit bespoke at the moment, but you can benchmark a custom manifest, output EARL with stats, and use the comparison tool to see if changes have an effect on specific examples and algorithms. Let me know if you need help with that. It wasn't well documented. The intent was to move some of that to be general across implementations.
Would be great to see some optimization related patches. Ideally patches would include realistic test cases too. And we could grow that test case collection along with the fixes to avoid future performance regressions.
thanks @davidlehn for the thoughtful feedback. I think these comments make sense in our context, and it seems like there are a few steps where we might be able to contribute. We don't have a lot of resources available, but because jsonld is such a key part of our flow we'd like to try to contribute a few improvements.
Can you tell me a bit more about the benchmark code? is this something that is public that I can run? We'd be happy to share our known bad cases to a test suite to be able to verify these improvements.
Given that you've answered our specific question about LLM code, we can probably close this specific issue (as I was asking about policy). As a next step, I'll take a look at profiling some of our known slow code along with some LLM insights and see if we can get some specific improvements made with help with the issues that we're seeing.