pygraphistry icon indicating copy to clipboard operation
pygraphistry copied to clipboard

GFQL named matcher collision causes runtime KeyError

Open lmeyerov opened this issue 2 months ago • 0 comments

GFQL Named Matcher Collision Causes Runtime KeyError

Summary

GFQL currently assumes matcher names are unique per entity type. When two node matchers (or two edge matchers) reuse the same name=..., execution reaches combine_steps() where the merged DataFrame no longer has the original column name (pandas/cuDF suffixes duplicates). The code then attempts out_df[op._name] and crashes with a raw KeyError. The same happens when the graph already has a column with that name before the matcher runs.

Minimal Repros

Run with PYTHONPATH=. so the local tree is used.

import pandas as pd
from graphistry.tests.test_compute import CGFull
from graphistry import n, e_forward

g = (
    CGFull()
    .nodes(pd.DataFrame({'node': [0, 1, 2, 3]}))
    .edges(pd.DataFrame({'source': [0, 1, 2], 'target': [1, 2, 3]}))
    .bind(node='node', source='source', destination='target')
)

# 1. Node/node duplicate -> KeyError
chain_node_node = [
    n({'node': 0}, name='dup'),
    e_forward(),
    n({'node': 1}, name='dup'),
]
g.gfql(chain_node_node)

# 2. Edge/edge duplicate -> KeyError
chain_edge_edge = [
    n({'node': 0}),
    e_forward(name='dup'),
    n({'node': 1}),
    e_forward(name='dup'),
    n({'node': 2}),
]
g.gfql(chain_edge_edge)

# 3. Existing column + matcher name -> KeyError
chain_existing = [
    n({'node': 0}),
    e_forward(),
    n({'node': 1}, name='node'),  # graph already has “node” column
]
g.gfql(chain_existing)

# FYI node vs edge reuse is fine
chain_node_edge = [
    n({'node': 0}, name='dup'),
    e_forward(name='dup'),
    n({'node': 1}),
]
res = g.gfql(chain_node_edge)
print(res._nodes.columns)  # ['node', 'dup']
print(res._edges.columns)  # ['dup', 'source', 'target']

Stack trace excerpt:

File ".../graphistry/compute/chain.py", line 221, in combine_steps
    s = out_df[op._name]
KeyError: 'dup'

Expected vs Actual

  • Expected: Either duplicate names are rejected up front with a descriptive error (and docs specify uniqueness), or the engine resolves conflicts via a policy (auto-suffix, overwrite, etc.) so execution succeeds.
  • Actual: Execution proceeds until pandas/cuDF renames the duplicate columns (dup_x, dup_y), after which combine_steps() raises an unhandled KeyError.

Suggested Fixes

  1. Validate earlier: Detect duplicate matcher names per entity type (and collisions with existing columns) before executing, raising a GFQLSchemaError with actionable guidance.
  2. Introduce a conflict policy: Support error / overwrite / suffix behaviors so users can pick how duplicates resolve, while keeping runtime stable.
  3. At minimum: Catch the collision and re-raise with a helpful error instead of a raw KeyError.

Handling the pre-existing column case is important as well—users often reuse common column names like node or flag. Without a guard, even a single named matcher can explode if the graph already has that column.

lmeyerov avatar Oct 20 '25 00:10 lmeyerov