kuzu
kuzu copied to clipboard
GDS and Recursive Joins TODOs
This issue is intended to keep track of the suite of optimizations, refactoring, enhancements, experimentations, and documentations we plan to implement/write in both the new fast recursive join implementations and the upcoming Kuzu GDS functions.
Recursive Joins TODOs
- [x] Refactor: Implement getParameterTypeIDs() in each rjFunction separately in rec_joins.h. (Xiyang)
- [x] Enhancement: Currently paths return only the intermediate node IDs. Also return edgeIDs. 2 main things need to be done on the backend: (i) ParentIterAndNextPtr* need to change; (ii) My suggestion is to change SingleSPPathsEdgeCompute to also use ParentIterAndNextPtr. The SinglePaths data structure can be thrown away. See my comment in SinglePaths::setParentEdge. (Xiyang)
- [ ] Enhancement: Integrate predicates to recursive joins, e.g.,
(a)-[r:* | WHERE r.since < 2022 AND n.age > 45]->(b)
. (Xiyang) - [x] Enhancement: Make recursive joins work on undirected recursive relationship patterns (e.g.,
(a)-[r:*]-(b)
) as well as when scanning the graph in backwards direction. Currently we only work for scanning the graph in the forward direction. When implementing the undirected patterns, we may need to change the ParentIterAndNextPtr to keep track of the directions of the edges. (Xiyang) - [ ] Optimization: Add early stopping to shortest paths computation when there are few destinations and have reached all.
- [x] Optimization: Fix the isMasked(offset, offset) to have a fast path. We currently use masks in Frontier implementations but we need to check a single location. So we need a fast function here. This should be very quick. (Xiyang).
- [x] Optimization: Currently when writing outputs in SingleSPOutputWriterPaths, we call the
singlePaths->getParent(curIntNode)
function, which usesnodeID_t
and internall does a lookup on the tableID here:parentEdges.at(nodeID.tableID).get()->buffer.data());
. For queries that contain nodes from only a single node table, we can avoid this by "fixing" the node table in SinglePaths. When writing parents (so during the actual recursive join comptuation and not when outputting the paths), we have a similar optimization where we call theSinglePaths::fixNodeTable
to avoid any table lookups in theparentEdges
field. We can have a similar optimization when writing paths if there is a single node table anyways. This may or may not matter since the map will be very small but it can still make a difference. (Xiyang) - [ ] Optimizations: Push down acyclic, trail, and walk constraints to the enumeration so that we don't blindly enumerate and write all paths to FactorizedTable first. Also see if anything smart can be done for variable-length joins during the path computation if we are finding acyclic or trail paths (e.g., can we avoid visiting the same node twice).
- [ ] Enhancement: Implement weighted shortest paths.
- [ ] Enhancement: Currently the
Mask
data structures that are used in the semijoin masks as well as to pass in the possible source and destination node IDs to recursive join algorithms is using OS memory. We should make it use the MemoryManager. (Xiyang) - [ ] Change the LENGTH field to be UINT16_t instead of UINT64_t. This might be minor but leads to unnecessary writes to value vectors as well as FactorizedTable's. (Xiyang)
- [x] Optimization: Reimplement header caching-based fast scans in on_disk_graph. (Ben)
- [x] Optimization: Graph::scan functions should return iterators instead of copying the data from the buffer manager to a value vector. This may requires keeping pages in the buffer manager pinned for longer time but it should be fine as we do very quick work on each edge in edgeCompute() functions. (Ben).
- [ ] Optimization: RJOutputWriters currently write one path at a time to the FactorizedTable. This corresponds to tuple-at-a-time processing. I am guessing this is an important bottleneck. Instead, we should buffer the writes to the Factorized Table. (Ben)
- [ ] OnDiskGraph random scan optimization: Ben also had another optimization he had added earlier to avoid scanning the entire CSR header in randomAccess mode. It apparently should only improve performance if you scanned relatively few offsets from a node group, given the header caching optimization.
- [ ] Optimization: We use virtual functions all over the EdgeCompute and VertexCompute and RJOutputWriter and Frontiers etc. implementations. We should consider templating these and benchmarking the benefits of these on large graphs. (Ben)
- [ ] Optimization: Make VertexCompute run on entire morsels and EdgeCompute run on entire neighbors (to minimize function calls between GDSUtils and these virtual function calls). (Semih)
- [ ] Optimization: Bidirectional and direction-optimizing joins. Bidirectional joins is the optimization for cases when there is a single source and a single destination and implements the heuristic to take a step from the smaller frontier at each step. This can be used both for shortest as well as variable length joins (and will likely have an annoying backtracking algorithm). (Semih)
- [ ] Optimization: We currently materialize the outputs of each recursive join. Ideally, GDSCall (and RJAlgorithm) should either be part of a larger pipeline including the next pipeline P_{next} or take that pipeline as field and instead of calling
GDSUtils::runVertexComputeIteration()
by passing anRJOutputWriterVC
, it should schedule the P_{next} in the task scheduler and P_{next} should be configured to scan the outputs of the RecJoin operator (for this we will need new scan operators that implement the backtracking logic that is currently implemented in theRJOutputWriterVC
VertexCompute functions. (Xiyang) - [ ] Experiment/Optimization: We should make the ALL_PATHS_BLOCK_SIZE configurable and evaluate its performance on recursive joins on large graphs.
- [ ] Optimization: Make the scan of rels be sequential and batched when possible. Rel table scan can take the advantage of bound nodes being sequential or adjacent to cache scanned rels from storage. The idea is to make full use of the output ValueVectors to cache at most 2048 rels when scanning from disk, and output rels for each bound node by changing SelectionVector over the output ValueVectors. Note that the caching is avoided when the input bound node is a single one, which is currently how the access is done in OnDiskGraph, as there is usually no benefit from caching for a single node. The change should be when possible, change the input nodeIDVector to be filled with a sequence of nodeIDs and populate the SelectionVector according to the caller (e.g., whether the nodeID is active or not). (Ben)
- [ ] Optimization: we should keep few elements in
ParentList
depends on the configuration of recursive join. E.g. direction is not needed if edges are all in the same direction.
GDS TODOs
- [ ] Refactor: Reimplement PageRank and Weakly Connected Components to adopt the Ligra API. (Semih/Xiyang)
- [ ] Enhancement: Make GDS algorithms extensions, so the community can develop their own GDS algorithms and share. Some of the core GDS algorithms, such as PageRank and connected components, can be at the core, but others can also be extensions. Ideally, we should develop all GDS algorithms as extensions but have the option to put some of them into our core binaries. (Ziyi and discuss with Chang)
- [ ] Documentation: Write documentation about how the community can implement their own GDS algorithms.
- [ ] Enhancement: Implement Strongly Connected Components, Betweenness Centrality, 1 popular node embedding algorithm, 1 popular community detection algorithm (Semih, Xiyang, Interns)
- [ ] Documentation: Write documentation and tutorials about both how the GDS algorithms work as well as separate documentation for each GDS algorithm we implement. We should try to come up with a template for these algorithms and suggest the community to use our documentation template.
- [ ] Enhancement: Make progress bar work for pipelines working on GDS Algorithms. (Mattias)