duckpgq-extension
duckpgq-extension copied to clipboard
Implement PathFindingOperator
This issue serves as a way to track the progress on the PathFindingOperator
Working on in https://github.com/cwida/duckpgq-extension/tree/pathfindingoperator and https://github.com/cwida/duckdb-pgq/tree/pathfindingoperator (Make sure to be on the correct branch in both repositories)
The idea is to create a path-finding operator with two sinks. This acts similarly to the IEJoin. We insert that in this function, instead of the iterativelength() UDF. For this binding phase, we generate a logical query plan, so there cannot be a physical path-finding operator inserted quite yet. We need to create the two sinks here. One side is the src, and dst pairs (tasks) and the other side is the CSR. Importantly without the CREATE_CSR_EDGE() UDF because that will be done in one of the sinks of the new operator.
Can include optimizations such as https://github.com/cwida/duckpgq-extension/issues/23
Plan for now:
- Get the CSR as the first sink to this new operator.
- Get the (src,dst)-pairs as the second sink.
- Implement the path-finding algorithm
- Look into how to parallelize.
Potential optimizations:
- Duplicate (src,dst)-pairs -> only execute it once, and later blow it up to get all the results again.
- Same src many times -> collapses into one src, then fully explore the graph
An example query for what we have for now (initial idea):
SELECT *
FROM pairs AS p
WHERE p.src BETWEEN (select csr_id from (SELECT
0 as csr_id,
(SELECT count(a.id) FROM Student a),
CAST (
(SELECT sum(CREATE_CSR_VERTEX(0,
(SELECT count(a.id) FROM Student a),
sub.dense_id,
sub.cnt)
)
FROM (
SELECT a.rowid as dense_id, count(k.src) as cnt
FROM Student a
LEFT JOIN Knows k ON k.src = a.id
GROUP BY a.rowid) sub
) AS BIGINT),
a.rowid,
c.rowid,
k.rowid FROM Knows k
JOIN student a on a.id = k.src
JOIN student c on c.id = k.dst)) AND p.dst;
TODO: Figure out how to include the lower and upper bound into the query
In the current implementation, when doing the following query, it first computes all shortest paths and only then filters out the pairs:
-FROM GRAPH_TABLE (pg
MATCH p = ANY SHORTEST (a:Person)-[k:Knows]->{2,3}(b:Person)
WHERE (a.id, b.id) in (SELECT (src, dst) FROM pairs)
COLUMNS (a.id AS id1, b.id AS id2, element_id(p))
) tmp
ORDER BY tmp.id1, tmp.id2;
It should ideally first do the filter on the pairs, and only then do the shortest path function. This could be a potential optimization rule if we can detect this.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.