kuzu icon indicating copy to clipboard operation
kuzu copied to clipboard

Optimization: Avoid hash join when scanning rel table

Open adsharma opened this issue 7 months ago • 1 comments

Description

Using this database

kuzu> EXPLAIN MATCH (p1:nodes)-[k:edges]->(p2:nodes) RETURN p1.id;
┌───────────────────────────────┐
│┌─────────────────────────────┐│
││        Physical Plan        ││
│└─────────────────────────────┘│
└───────────────────────────────┘
┌───────────────────────────────┐
│      RESULT_COLLECTOR[3]      │
│   -------------------------   │
│      Expressions: p1.id       │
│   -------------------------   │
│      NumOutputTuples: 0       │
│   -------------------------   │
│    ExecutionTime: 0.000000    │
└───────────────┬───────────────┘
┌───────────────┴───────────────┐
│         PROJECTION[2]         │
│   -------------------------   │
│      Expressions: p1.id       │
│   -------------------------   │
│      NumOutputTuples: 0       │
│   -------------------------   │
│    ExecutionTime: 0.000000    │
└───────────────┬───────────────┘
┌───────────────┴───────────────┐
│       SCAN_REL_TABLE[1]       │
│   -------------------------   │
│         Tables: edges         │
│           Alias: k            │
│   Direction: (p1)-[k]->(p2)   │
│   -------------------------   │
│      NumOutputTuples: 0       │
│   -------------------------   │
│    ExecutionTime: 0.000000    │
└───────────────┬───────────────┘
┌───────────────┴───────────────┐
│      SCAN_NODE_TABLE[0]       │
│   -------------------------   │
│         Tables: nodes         │
│           Alias: p1           │
│       Properties: p1.id       │
│   -------------------------   │
│      NumOutputTuples: 0       │
│   -------------------------   │
│    ExecutionTime: 0.000000    │
└───────────────────────────────┘

kuzu> EXPLAIN MATCH (p1:nodes)-[k:edges]->(p2:nodes) RETURN p1.id, p2.id;
┌───────────────────────────────┐
│┌─────────────────────────────┐│
││        Physical Plan        ││
│└─────────────────────────────┘│
└───────────────────────────────┘
┌───────────────────────────────┐
│      RESULT_COLLECTOR[6]      │
│   -------------------------   │
│      Expressions: p1.id       │
│             p2.id             │
│   -------------------------   │
│      NumOutputTuples: 0       │
│   -------------------------   │
│    ExecutionTime: 0.000000    │
└───────────────┬───────────────┘
┌───────────────┴───────────────┐
│         PROJECTION[5]         │
│   -------------------------   │
│      Expressions: p1.id       │
│             p2.id             │
│   -------------------------   │
│      NumOutputTuples: 0       │
│   -------------------------   │
│    ExecutionTime: 0.000000    │
└───────────────┬───────────────┘
┌───────────────┴───────────────┐
│      HASH_JOIN_PROBE[4]       │
│   -------------------------   │
│         Keys: p2._ID          │
│   -------------------------   │─────────────────┐
│      NumOutputTuples: 0       │                 │
│   -------------------------   │                 │
│    ExecutionTime: 0.000000    │                 │
└───────────────┬───────────────┘                 │
┌───────────────┴───────────────┐ ┌───────────────┴───────────────┐
│       SCAN_REL_TABLE[1]       │ │      HASH_JOIN_BUILD[3]       │
│   -------------------------   │ │   -------------------------   │
│         Tables: edges         │ │         Keys: p2._ID          │
│           Alias: k            │ │        Payloads: p2.id        │
│   Direction: (p1)-[k]->(p2)   │ │   -------------------------   │
│   -------------------------   │ │      NumOutputTuples: 0       │
│      NumOutputTuples: 0       │ │   -------------------------   │
│   -------------------------   │ │    ExecutionTime: 0.000000    │
│    ExecutionTime: 0.000000    │ │                               │
└───────────────┬───────────────┘ └───────────────┬───────────────┘
┌───────────────┴───────────────┐ ┌───────────────┴───────────────┐
│      SCAN_NODE_TABLE[0]       │ │      SCAN_NODE_TABLE[2]       │
│   -------------------------   │ │   -------------------------   │
│         Tables: nodes         │ │         Tables: nodes         │
│           Alias: p1           │ │           Alias: p2           │
│       Properties: p1.id       │ │       Properties: p2.id       │
│   -------------------------   │ │   -------------------------   │
│      NumOutputTuples: 0       │ │      NumOutputTuples: 0       │
│   -------------------------   │ │   -------------------------   │
│    ExecutionTime: 0.000000    │ │    ExecutionTime: 0.000000    │
└───────────────────────────────┘ └───────────────────────────────┘

It's not clear why the plan changes to a hash join when p2.id is added to the return value. My understanding based on #5743 is that the fwd/bwd neighbor id is p2.id and is available without having to access to access the node table.

adsharma avatar Sep 22 '25 21:09 adsharma

I read through some of the code. Looks like REL tables contain only internalID_t. Probably by design. Logic is likely that people rarely want p2.id alone. They want some attribute of p2 as well, which forces a join anyway.

But a side-effect of this design decision is that if someone is using kuzu as a triple store, they're going to pay the cost of a hash join even on a simple scan of the rel table for the purpose of running a graph algorithm.

Using internal ids is not an option:

kuzu> EXPLAIN MATCH (p1:nodes)-[k:edges]->(p2:nodes) RETURN p1._ID, p2._ID;
Error: Binder exception: _ID is reserved for system usage. External access is not allowed.

adsharma avatar Sep 23 '25 00:09 adsharma