GFQL named matcher collision causes runtime KeyError
GFQL Named Matcher Collision Causes Runtime KeyError
Summary
GFQL currently assumes matcher names are unique per entity type. When two node matchers (or two edge matchers) reuse the same name=..., execution reaches combine_steps() where the merged DataFrame no longer has the original column name (pandas/cuDF suffixes duplicates). The code then attempts out_df[op._name] and crashes with a raw KeyError. The same happens when the graph already has a column with that name before the matcher runs.
Minimal Repros
Run with PYTHONPATH=. so the local tree is used.
import pandas as pd
from graphistry.tests.test_compute import CGFull
from graphistry import n, e_forward
g = (
CGFull()
.nodes(pd.DataFrame({'node': [0, 1, 2, 3]}))
.edges(pd.DataFrame({'source': [0, 1, 2], 'target': [1, 2, 3]}))
.bind(node='node', source='source', destination='target')
)
# 1. Node/node duplicate -> KeyError
chain_node_node = [
n({'node': 0}, name='dup'),
e_forward(),
n({'node': 1}, name='dup'),
]
g.gfql(chain_node_node)
# 2. Edge/edge duplicate -> KeyError
chain_edge_edge = [
n({'node': 0}),
e_forward(name='dup'),
n({'node': 1}),
e_forward(name='dup'),
n({'node': 2}),
]
g.gfql(chain_edge_edge)
# 3. Existing column + matcher name -> KeyError
chain_existing = [
n({'node': 0}),
e_forward(),
n({'node': 1}, name='node'), # graph already has “node” column
]
g.gfql(chain_existing)
# FYI node vs edge reuse is fine
chain_node_edge = [
n({'node': 0}, name='dup'),
e_forward(name='dup'),
n({'node': 1}),
]
res = g.gfql(chain_node_edge)
print(res._nodes.columns) # ['node', 'dup']
print(res._edges.columns) # ['dup', 'source', 'target']
Stack trace excerpt:
File ".../graphistry/compute/chain.py", line 221, in combine_steps
s = out_df[op._name]
KeyError: 'dup'
Expected vs Actual
- Expected: Either duplicate names are rejected up front with a descriptive error (and docs specify uniqueness), or the engine resolves conflicts via a policy (auto-suffix, overwrite, etc.) so execution succeeds.
- Actual: Execution proceeds until pandas/cuDF renames the duplicate columns (
dup_x,dup_y), after whichcombine_steps()raises an unhandledKeyError.
Suggested Fixes
- Validate earlier: Detect duplicate matcher names per entity type (and collisions with existing columns) before executing, raising a
GFQLSchemaErrorwith actionable guidance. - Introduce a conflict policy: Support
error/overwrite/suffixbehaviors so users can pick how duplicates resolve, while keeping runtime stable. - At minimum: Catch the collision and re-raise with a helpful error instead of a raw KeyError.
Handling the pre-existing column case is important as well—users often reuse common column names like node or flag. Without a guard, even a single named matcher can explode if the graph already has that column.