Proposal for Enhanced Graph Cloning in Apache Jena
Version
5.0.0-SNAPSHOT
Feature
I've noticed that the requirement to clone a graph arises frequently in our codebase. Currently, this is our approach:
var copy = GraphFactory.createGraphMem();
graph.find().forEachRemaining(copy::add);
While this method is functional, I believe there is room for significant performance improvement. The attached chart illustrates the speed difference between the current method (findAndAddAll) and the proposed deep clone implementation (copy).
In one of my feature branches, I've implemented deep cloning for all three GraphMem2 implementations. Although I'm uncertain about how this could be integrated into the main Graph interface, it has proven to be highly beneficial for GraphMem2.
To enable this feature, I provided copy constructors for all classes under mem2.collections and mem2.store. A notable concern with this approach is the added complexity it introduces to the codebase. While this complexity is manageable in my current context, it's important for the Jena community to evaluate and decide whether this aligns with the broader project goals and standards.
Would this deep cloning feature be of interest for inclusion in the Jena project?
For reference, here is GraphMem2#copy in my feature branch: GraphMem2.java#L157
Are you interested in contributing a solution yourself?
Yes
I implemented the feature in a separate clean branch . Any comments are welcome.
Could someone take a look at this improvement?
Unfortunately, I don't see a way to implement really fast cloning of graphs as a lightweight sidecar or in a separate module or project. I have extended the underlying collections and then implemented the function within the graphs.
If I want to use fast cloning today, I need to maintain duplicates of the source code of the collections and the graphs in my project. --> Do you see the extension as beneficial for the Jena project? --> Otherwise, I would also be happy if someone had an idea for a less invasive implementation.
Could someone take a look at this improvement?
Unfortunately, I don't see a way to implement really fast cloning of graphs as a lightweight sidecar or in a separate module or project. I have extended the underlying collections and then implemented the function within the graphs.
Tbh the branch doesn't look that heavyweight to me.
What I might consider is generalising your approach by adding a new interface Copyable<T> which provides a single T copy() method and having all relevant interfaces/classes extend/implement as appropriate.
You could then consider adding a new utility method Graph G.copy(Graph original) that could check whether the provided instance implements the new interface, and if so defers to the copy() method, otherwise does G.copyGraphSrcToDst(original, GraphFactory.createGraphMem())
If I want to use fast cloning today, I need to maintain duplicates of the source code of the collections and the graphs in my project. --> Do you see the extension as beneficial for the Jena project? --> Otherwise, I would also be happy if someone had an idea for a less invasive implementation.
Yes there's definitely value here
Generalising it and making it opt-in may make it less invasive and allow us to adopt it over time in other places as appropriate.
For reference - G.copyGraphSrcToDst() - which is basically equivalent to your copy in your description
Thanks for your advice!
I added the Copyable<> interface in org.apache.jena.atlas.lib. That seemed to be the appropriate location since e.g. the Closeble interface is also located there.
G#copy is also available.