Optimization: Avoid hash join when scanning rel table
Description
Using this database
kuzu> EXPLAIN MATCH (p1:nodes)-[k:edges]->(p2:nodes) RETURN p1.id;
┌───────────────────────────────┐
│┌─────────────────────────────┐│
││ Physical Plan ││
│└─────────────────────────────┘│
└───────────────────────────────┘
┌───────────────────────────────┐
│ RESULT_COLLECTOR[3] │
│ ------------------------- │
│ Expressions: p1.id │
│ ------------------------- │
│ NumOutputTuples: 0 │
│ ------------------------- │
│ ExecutionTime: 0.000000 │
└───────────────┬───────────────┘
┌───────────────┴───────────────┐
│ PROJECTION[2] │
│ ------------------------- │
│ Expressions: p1.id │
│ ------------------------- │
│ NumOutputTuples: 0 │
│ ------------------------- │
│ ExecutionTime: 0.000000 │
└───────────────┬───────────────┘
┌───────────────┴───────────────┐
│ SCAN_REL_TABLE[1] │
│ ------------------------- │
│ Tables: edges │
│ Alias: k │
│ Direction: (p1)-[k]->(p2) │
│ ------------------------- │
│ NumOutputTuples: 0 │
│ ------------------------- │
│ ExecutionTime: 0.000000 │
└───────────────┬───────────────┘
┌───────────────┴───────────────┐
│ SCAN_NODE_TABLE[0] │
│ ------------------------- │
│ Tables: nodes │
│ Alias: p1 │
│ Properties: p1.id │
│ ------------------------- │
│ NumOutputTuples: 0 │
│ ------------------------- │
│ ExecutionTime: 0.000000 │
└───────────────────────────────┘
kuzu> EXPLAIN MATCH (p1:nodes)-[k:edges]->(p2:nodes) RETURN p1.id, p2.id;
┌───────────────────────────────┐
│┌─────────────────────────────┐│
││ Physical Plan ││
│└─────────────────────────────┘│
└───────────────────────────────┘
┌───────────────────────────────┐
│ RESULT_COLLECTOR[6] │
│ ------------------------- │
│ Expressions: p1.id │
│ p2.id │
│ ------------------------- │
│ NumOutputTuples: 0 │
│ ------------------------- │
│ ExecutionTime: 0.000000 │
└───────────────┬───────────────┘
┌───────────────┴───────────────┐
│ PROJECTION[5] │
│ ------------------------- │
│ Expressions: p1.id │
│ p2.id │
│ ------------------------- │
│ NumOutputTuples: 0 │
│ ------------------------- │
│ ExecutionTime: 0.000000 │
└───────────────┬───────────────┘
┌───────────────┴───────────────┐
│ HASH_JOIN_PROBE[4] │
│ ------------------------- │
│ Keys: p2._ID │
│ ------------------------- │─────────────────┐
│ NumOutputTuples: 0 │ │
│ ------------------------- │ │
│ ExecutionTime: 0.000000 │ │
└───────────────┬───────────────┘ │
┌───────────────┴───────────────┐ ┌───────────────┴───────────────┐
│ SCAN_REL_TABLE[1] │ │ HASH_JOIN_BUILD[3] │
│ ------------------------- │ │ ------------------------- │
│ Tables: edges │ │ Keys: p2._ID │
│ Alias: k │ │ Payloads: p2.id │
│ Direction: (p1)-[k]->(p2) │ │ ------------------------- │
│ ------------------------- │ │ NumOutputTuples: 0 │
│ NumOutputTuples: 0 │ │ ------------------------- │
│ ------------------------- │ │ ExecutionTime: 0.000000 │
│ ExecutionTime: 0.000000 │ │ │
└───────────────┬───────────────┘ └───────────────┬───────────────┘
┌───────────────┴───────────────┐ ┌───────────────┴───────────────┐
│ SCAN_NODE_TABLE[0] │ │ SCAN_NODE_TABLE[2] │
│ ------------------------- │ │ ------------------------- │
│ Tables: nodes │ │ Tables: nodes │
│ Alias: p1 │ │ Alias: p2 │
│ Properties: p1.id │ │ Properties: p2.id │
│ ------------------------- │ │ ------------------------- │
│ NumOutputTuples: 0 │ │ NumOutputTuples: 0 │
│ ------------------------- │ │ ------------------------- │
│ ExecutionTime: 0.000000 │ │ ExecutionTime: 0.000000 │
└───────────────────────────────┘ └───────────────────────────────┘
It's not clear why the plan changes to a hash join when p2.id is added to the return value. My understanding based on #5743 is that the fwd/bwd neighbor id is p2.id and is available without having to access to access the node table.
I read through some of the code. Looks like REL tables contain only internalID_t. Probably by design.
Logic is likely that people rarely want p2.id alone. They want some attribute of p2 as well, which forces a join anyway.
But a side-effect of this design decision is that if someone is using kuzu as a triple store, they're going to pay the cost of a hash join even on a simple scan of the rel table for the purpose of running a graph algorithm.
Using internal ids is not an option:
kuzu> EXPLAIN MATCH (p1:nodes)-[k:edges]->(p2:nodes) RETURN p1._ID, p2._ID;
Error: Binder exception: _ID is reserved for system usage. External access is not allowed.