pygraphistry GFQL path support - path naming, path tracking, and path-sensitive predicates

Description

Add native path support to GFQL for path naming, tracking, and path-sensitive predicates.

Motivation

Currently, GFQL doesn't return nested path objects like Cypher's MATCH p = ... RETURN p. Users must manually tag steps (e.g., name='p', path_id) to simulate path tracking. Native path support would make GFQL more expressive for path-centric queries.

Proposed Solution

When a query has path-sensitive subexpressions, the wavefront execution can:

Collect targeted subpaths as extra columns during traversal
Support path-sensitive predicates that operate on these collected paths
Enable path naming/binding similar to other graph query languages

Example Use Cases

# Desired: Native path support
g.gfql([
    n({"id": "Alice"}), 
    e_forward(path="p"),  # Track this path
    n({"id": "Bob"})
]).paths("p")  # Return the paths

# Current workaround: Manual tagging
g.gfql([
    n({"id": "Alice", "p": True}), 
    e_forward(name="p"),
    n({"id": "Bob", "p": True})
])

Priority

P3 - Low priority enhancement

Research Phase 1 Complete ✅

I've completed initial research on path variable support for cross-node predicates. Summary:

Problem Clarification

The issue is about cross-node predicates in multi-hop patterns, not path objects:

-- Want: Compare properties across different nodes in the pattern
MATCH (n1)-[e1]->(n2)-[e2]->(n3) WHERE n1.x == n3.x

Currently impossible in GFQL because wavefront execution only tracks current node IDs, not full path state.

Root Cause

GFQL's wavefront model:

Each hop step operates on a wavefront (set of node IDs)
Only current wavefront accessible to subsequent operations
Previous nodes lost - no way to reference them in predicates

Proposed Solution: Path-Aware Wavefront

Extend wavefront DataFrame to track growing path expressions as columns:

# Step 1: Match n1
wavefront = [{'path_id': 0, 'n1': 'A', 'n1_x': 10}]

# Step 2: Hop to n2
wavefront = [{'path_id': 0, 'n1': 'A', 'n1_x': 10, 'e1': 1, 'n2': 'D'}]

# Step 3: Hop to n3, filter n1_x == n3_x
wavefront = [{'path_id': 0, 'n1': 'A', 'n1_x': 10, 'e1': 1, 'n2': 'D', 'e2': 2, 'n3': 'F', 'n3_x': 10}]
wavefront = wavefront.query('n1_x == n3_x')  # Cross-variable predicate

API Design (Proposed)

g.gfql([
    n({'type': 'person'}, var='n1'),      # Declare variable
    e_forward(var='e1'),
    n(var='n2'),
    e_forward(var='e2'),
    n(var='n3', query='@n1_x == @n3_x')  # Cross-variable predicate
])

Prototype Results

Created working proof-of-concept: docs/research/path_wavefront_prototype.py

$ python docs/research/path_wavefront_prototype.py
Pattern: (n1)-[e1]->(n2)-[e2]->(n3) WHERE n1.x == n3.x
Final Result: 3 paths satisfy predicate
✅ Memory overhead: -11% (more efficient than expected!)

Critical Architecture Constraint

GFQL uses a 3-pass execution model (forward→backward→forward):

Forward: Optimistically build prefix tries
Backward: Prune dead-end paths
Forward: Execute with pruned search space

Path tracking must integrate with this model while preserving:

target_wave_front pruning semantics
Path ID stability across passes
Declarative execution guarantees

Research Documents

Branch: research/gfql-paths

gfql-path-support-summary.md - Executive summary
gfql-path-variables.md - Detailed technical analysis
path_wavefront_prototype.py - Working demo

Next Steps

Design Review - Gather feedback on proposed approach
API Finalization - Decide on variable naming syntax
Implementation Planning - Break into phases:
- Phase 1: PathContext class
- Phase 2: hop() integration
- Phase 3: Predicate syntax
- Phase 4: Optimization

Open Questions

Variable naming: Explicit var='n1' or implicit position-based?
Property selection: Auto-detect or explicit declaration?
Predicate timing: Apply in first forward pass or only final pass?
Return value: Path-aware Plottable or new Path object?

Status: ✅ Research complete, ready for design review

Oct 19 '25 08:10 lmeyerov

Research Update: Test-Driven Validation ✅

Per feedback, replaced toy prototype with rigorous test suites designed to break naive implementations.

Test Suites Created

1. Functional Tests (path_tracking_tests.py)

9 tests covering basic functionality
All tests passing (9/9)
Cross-node equality, path explosion, multiple predicates, etc.

2. Adversarial Tests (path_adversarial_tests.py)

7 tests targeting specific failure modes
All tests passing (7/7) - meaning they correctly IDENTIFY pitfalls
Designed to expose implementation bugs

Critical Findings from Adversarial Testing

1. Multi-Edge Path Counting

Test Result: Without edge-based IDs, 10 wrong paths counted (59% error rate)

Graphs have parallel edges:

A -[e1:type_a]-> B
A -[e2:type_b]-> B

Requirement: Path IDs MUST use edge IDs, not (src, dst) tuples.

2. Property Column Swapping

Test Result: Confusing n1.x with n3.x produces 7 wrong matches

Requirement: Variable properties MUST be scoped: n1_x, n2_x, n3_x (not reusing x).

3. Path ID Collision After Pruning

Problem: 3-pass execution (forward→backward→forward) must preserve path IDs

Requirement: IDs assigned in first forward pass MUST NOT be reassigned after backward pruning.

4. Predicate Timing

Test Result: Can't evaluate n1.x == n3.x until n3 is bound

Pattern: (n1)-[]->(n2)-[]->(n3) WHERE n1.x == n3.x

After step 2: n3 not yet available ❌
After step 3: NOW can evaluate ✓

Requirement: Predicates MUST only evaluate when ALL referenced variables are bound.

5. Cartesian Explosion

Test Result: Without incremental filtering, 100% wasted work in some queries

Requirement: Filter at EACH step as soon as predicates can be evaluated, not at end.

6. Column Name Collisions

Problem: User's graph has column n1_x or even __gfql_path_id__

Requirement: Use reserved namespace like __gfql_var_n1_x__.

7. Type Confusion

Problem: Comparing n1.x=10 (int) with n3.x="10" (string)

Requirement: Enforce type consistency or follow pandas semantics.

Implementation Requirements Checklist

Before claiming implementation works:

[ ] Multi-edge graphs produce correct path counts
[ ] User columns don't collide with internal columns
[ ] Path IDs stable across 3-pass execution
[ ] Predicates only evaluate when variables available
[ ] Incremental filtering prevents explosion
[ ] Variable property columns clearly scoped
[ ] Type mismatches handled consistently

Files Added

docs/research/
├── gfql-path-variables.md              # Technical analysis
├── gfql-path-support-summary.md        # Executive summary
├── path_wavefront_prototype.py         # Simple demo (for concepts)
├── path_tracking_tests.py              # 9 functional tests ✓
├── path_adversarial_tests.py           # 7 adversarial tests ✓
└── CRITICAL_REQUIREMENTS.md            # Implementation requirements

Test Execution

# Run functional tests
$ python docs/research/path_tracking_tests.py
Passed: 9/9 ✓

# Run adversarial tests
$ python docs/research/path_adversarial_tests.py
All 7 pitfalls correctly identified ✓

Key Lesson

Simple prototypes lie. Real validation requires adversarial testing with:

Multi-edge graphs
Column name collisions
Type mismatches
Cartesian explosion
3-pass execution complexity

Next Steps

Review critical requirements
Decide if complexity is worth the feature
If yes: Implement guided by test suite
If no: Document limitations and close

Research Status: ✅ VALIDATED - Real tests expose real challenges

Oct 19 '25 08:10 lmeyerov

⚠️ CRITICAL: False Positive Detection Results

Added test suite specifically targeting correctness - cases where naive implementations return WRONG results.

False Positive Rates (Devastating)

Test	Naive Error Rate	Description
Dead-End Chains	48%	Partial matches that don't complete: 12/25 paths wrong
Loop Detection	33%	Self-loops when pattern says n1 != n3: 10/30 wrong
Edge Direction	50%	Undirected when should be directed: 30/60 wrong
Inequality Ops	59%	Using <= instead of <: 13/22 paths wrong
Multi-Field AND	88%	Using OR instead of AND: 23/26 paths WRONG!
Edge Order	100%	Comparing e2.w < e1.w instead of e1.w < e2.w: Disjoint wrong set
Long Chains (5-hop)	74%	Losing constraint tracking: 154/209 paths wrong

Example: Multi-Field AND Catastrophe (88% Error!)

Pattern: WHERE n1.x == n3.x AND n1.y < n3.y

Correct (both conditions): 3 paths ✓
Naive bug (using OR):      26 paths ✗
FALSE POSITIVES:           23 paths (88% wrong!)

Wrong results:

A -> C: y less but x NOT match (10 != 20) ✗
B -> B: x match but y NOT less (2 >= 2) ✗

Example: Long Chain Disaster (74% Error!)

Pattern: 5-hop with n1.x == n6.x

Total 5-hop paths:         209
Correct matches:           55 ✓
Without constraint check:  209 ✗
FALSE POSITIVES:           154 (74% wrong!)

Test Suite Summary

Total: 23 Tests

9 functional tests (basic features work)
7 adversarial tests (edge cases handled)
7 false positive tests (correctness validated)

All 23 MUST pass before claiming implementation works.

# Run all tests
python docs/research/path_tracking_tests.py        # 9/9 ✓
python docs/research/path_adversarial_tests.py      # 7/7 ✓
python docs/research/test_false_positives.py        # 7/7 ✓

Critical Implementation Requirements

Check ALL predicates - Not just some (dead-end trap: 48% error)
Respect edge direction - Forward vs reverse (50% error if undirected)
Use exact operators - < not <= (59% error with wrong op)
AND not OR - For conjunctions (88% error with OR!)
Track through chains - Maintain constraints (74% error at 5 hops)
Detect loops - When n1 != n3 (33% error accepting loops)
Edge order matters - e1.w < e2.w not reversed (100% wrong)

Bottom Line

This feature is a correctness minefield. The tests show that naive implementations will return 33-88% WRONG results depending on the mistake.

Any implementation must pass all 23 tests with zero false positives.

Oct 19 '25 08:10 lmeyerov

GFQL path support - path naming, path tracking, and path-sensitive predicates

Description

Motivation

Proposed Solution

Example Use Cases

Priority

Related

Research Phase 1 Complete ✅

Problem Clarification

Root Cause

Proposed Solution: Path-Aware Wavefront

API Design (Proposed)

Prototype Results

Critical Architecture Constraint

Research Documents

Next Steps

Open Questions

Research Update: Test-Driven Validation ✅

Test Suites Created

Critical Findings from Adversarial Testing

1. Multi-Edge Path Counting

2. Property Column Swapping

3. Path ID Collision After Pruning

4. Predicate Timing

5. Cartesian Explosion

6. Column Name Collisions

7. Type Confusion

Implementation Requirements Checklist

Files Added

Test Execution

Key Lesson

Next Steps

⚠️ CRITICAL: False Positive Detection Results

False Positive Rates (Devastating)

Example: Multi-Field AND Catastrophe (88% Error!)

Example: Long Chain Disaster (74% Error!)

Test Suite Summary

Critical Implementation Requirements

Bottom Line