GFQL path support - path naming, path tracking, and path-sensitive predicates
Description
Add native path support to GFQL for path naming, tracking, and path-sensitive predicates.
Motivation
Currently, GFQL doesn't return nested path objects like Cypher's MATCH p = ... RETURN p. Users must manually tag steps (e.g., name='p', path_id) to simulate path tracking. Native path support would make GFQL more expressive for path-centric queries.
Proposed Solution
When a query has path-sensitive subexpressions, the wavefront execution can:
- Collect targeted subpaths as extra columns during traversal
- Support path-sensitive predicates that operate on these collected paths
- Enable path naming/binding similar to other graph query languages
Example Use Cases
# Desired: Native path support
g.gfql([
n({"id": "Alice"}),
e_forward(path="p"), # Track this path
n({"id": "Bob"})
]).paths("p") # Return the paths
# Current workaround: Manual tagging
g.gfql([
n({"id": "Alice", "p": True}),
e_forward(name="p"),
n({"id": "Bob", "p": True})
])
Priority
P3 - Low priority enhancement
Related
- Feature comparison table shows this as "Partial" support currently
- Would bring GFQL closer to Cypher/GSQL path handling capabilities
Research Phase 1 Complete ✅
I've completed initial research on path variable support for cross-node predicates. Summary:
Problem Clarification
The issue is about cross-node predicates in multi-hop patterns, not path objects:
-- Want: Compare properties across different nodes in the pattern
MATCH (n1)-[e1]->(n2)-[e2]->(n3) WHERE n1.x == n3.x
Currently impossible in GFQL because wavefront execution only tracks current node IDs, not full path state.
Root Cause
GFQL's wavefront model:
- Each hop step operates on a wavefront (set of node IDs)
- Only current wavefront accessible to subsequent operations
- Previous nodes lost - no way to reference them in predicates
Proposed Solution: Path-Aware Wavefront
Extend wavefront DataFrame to track growing path expressions as columns:
# Step 1: Match n1
wavefront = [{'path_id': 0, 'n1': 'A', 'n1_x': 10}]
# Step 2: Hop to n2
wavefront = [{'path_id': 0, 'n1': 'A', 'n1_x': 10, 'e1': 1, 'n2': 'D'}]
# Step 3: Hop to n3, filter n1_x == n3_x
wavefront = [{'path_id': 0, 'n1': 'A', 'n1_x': 10, 'e1': 1, 'n2': 'D', 'e2': 2, 'n3': 'F', 'n3_x': 10}]
wavefront = wavefront.query('n1_x == n3_x') # Cross-variable predicate
API Design (Proposed)
g.gfql([
n({'type': 'person'}, var='n1'), # Declare variable
e_forward(var='e1'),
n(var='n2'),
e_forward(var='e2'),
n(var='n3', query='@n1_x == @n3_x') # Cross-variable predicate
])
Prototype Results
Created working proof-of-concept: docs/research/path_wavefront_prototype.py
$ python docs/research/path_wavefront_prototype.py
Pattern: (n1)-[e1]->(n2)-[e2]->(n3) WHERE n1.x == n3.x
Final Result: 3 paths satisfy predicate
✅ Memory overhead: -11% (more efficient than expected!)
Critical Architecture Constraint
GFQL uses a 3-pass execution model (forward→backward→forward):
- Forward: Optimistically build prefix tries
- Backward: Prune dead-end paths
- Forward: Execute with pruned search space
Path tracking must integrate with this model while preserving:
target_wave_frontpruning semantics- Path ID stability across passes
- Declarative execution guarantees
Research Documents
Branch: research/gfql-paths
- gfql-path-support-summary.md - Executive summary
- gfql-path-variables.md - Detailed technical analysis
- path_wavefront_prototype.py - Working demo
Next Steps
- Design Review - Gather feedback on proposed approach
- API Finalization - Decide on variable naming syntax
- Implementation Planning - Break into phases:
- Phase 1: PathContext class
- Phase 2: hop() integration
- Phase 3: Predicate syntax
- Phase 4: Optimization
Open Questions
- Variable naming: Explicit
var='n1'or implicit position-based? - Property selection: Auto-detect or explicit declaration?
- Predicate timing: Apply in first forward pass or only final pass?
- Return value: Path-aware Plottable or new Path object?
Status: ✅ Research complete, ready for design review
Research Update: Test-Driven Validation ✅
Per feedback, replaced toy prototype with rigorous test suites designed to break naive implementations.
Test Suites Created
1. Functional Tests (path_tracking_tests.py)
- 9 tests covering basic functionality
- All tests passing (9/9)
- Cross-node equality, path explosion, multiple predicates, etc.
2. Adversarial Tests (path_adversarial_tests.py)
- 7 tests targeting specific failure modes
- All tests passing (7/7) - meaning they correctly IDENTIFY pitfalls
- Designed to expose implementation bugs
Critical Findings from Adversarial Testing
1. Multi-Edge Path Counting
Test Result: Without edge-based IDs, 10 wrong paths counted (59% error rate)
Graphs have parallel edges:
A -[e1:type_a]-> B
A -[e2:type_b]-> B
Requirement: Path IDs MUST use edge IDs, not (src, dst) tuples.
2. Property Column Swapping
Test Result: Confusing n1.x with n3.x produces 7 wrong matches
Requirement: Variable properties MUST be scoped: n1_x, n2_x, n3_x (not reusing x).
3. Path ID Collision After Pruning
Problem: 3-pass execution (forward→backward→forward) must preserve path IDs
Requirement: IDs assigned in first forward pass MUST NOT be reassigned after backward pruning.
4. Predicate Timing
Test Result: Can't evaluate n1.x == n3.x until n3 is bound
Pattern: (n1)-[]->(n2)-[]->(n3) WHERE n1.x == n3.x
- After step 2: n3 not yet available ❌
- After step 3: NOW can evaluate ✓
Requirement: Predicates MUST only evaluate when ALL referenced variables are bound.
5. Cartesian Explosion
Test Result: Without incremental filtering, 100% wasted work in some queries
Requirement: Filter at EACH step as soon as predicates can be evaluated, not at end.
6. Column Name Collisions
Problem: User's graph has column n1_x or even __gfql_path_id__
Requirement: Use reserved namespace like __gfql_var_n1_x__.
7. Type Confusion
Problem: Comparing n1.x=10 (int) with n3.x="10" (string)
Requirement: Enforce type consistency or follow pandas semantics.
Implementation Requirements Checklist
Before claiming implementation works:
- [ ] Multi-edge graphs produce correct path counts
- [ ] User columns don't collide with internal columns
- [ ] Path IDs stable across 3-pass execution
- [ ] Predicates only evaluate when variables available
- [ ] Incremental filtering prevents explosion
- [ ] Variable property columns clearly scoped
- [ ] Type mismatches handled consistently
Files Added
docs/research/
├── gfql-path-variables.md # Technical analysis
├── gfql-path-support-summary.md # Executive summary
├── path_wavefront_prototype.py # Simple demo (for concepts)
├── path_tracking_tests.py # 9 functional tests ✓
├── path_adversarial_tests.py # 7 adversarial tests ✓
└── CRITICAL_REQUIREMENTS.md # Implementation requirements
Test Execution
# Run functional tests
$ python docs/research/path_tracking_tests.py
Passed: 9/9 ✓
# Run adversarial tests
$ python docs/research/path_adversarial_tests.py
All 7 pitfalls correctly identified ✓
Key Lesson
Simple prototypes lie. Real validation requires adversarial testing with:
- Multi-edge graphs
- Column name collisions
- Type mismatches
- Cartesian explosion
- 3-pass execution complexity
Next Steps
- Review critical requirements
- Decide if complexity is worth the feature
- If yes: Implement guided by test suite
- If no: Document limitations and close
Research Status: ✅ VALIDATED - Real tests expose real challenges
⚠️ CRITICAL: False Positive Detection Results
Added test suite specifically targeting correctness - cases where naive implementations return WRONG results.
False Positive Rates (Devastating)
| Test | Naive Error Rate | Description |
|---|---|---|
| Dead-End Chains | 48% | Partial matches that don't complete: 12/25 paths wrong |
| Loop Detection | 33% | Self-loops when pattern says n1 != n3: 10/30 wrong |
| Edge Direction | 50% | Undirected when should be directed: 30/60 wrong |
| Inequality Ops | 59% | Using <= instead of <: 13/22 paths wrong |
| Multi-Field AND | 88% | Using OR instead of AND: 23/26 paths WRONG! |
| Edge Order | 100% | Comparing e2.w < e1.w instead of e1.w < e2.w: Disjoint wrong set |
| Long Chains (5-hop) | 74% | Losing constraint tracking: 154/209 paths wrong |
Example: Multi-Field AND Catastrophe (88% Error!)
Pattern: WHERE n1.x == n3.x AND n1.y < n3.y
Correct (both conditions): 3 paths ✓
Naive bug (using OR): 26 paths ✗
FALSE POSITIVES: 23 paths (88% wrong!)
Wrong results:
A -> C: y less but x NOT match (10 != 20)✗B -> B: x match but y NOT less (2 >= 2)✗
Example: Long Chain Disaster (74% Error!)
Pattern: 5-hop with n1.x == n6.x
Total 5-hop paths: 209
Correct matches: 55 ✓
Without constraint check: 209 ✗
FALSE POSITIVES: 154 (74% wrong!)
Test Suite Summary
Total: 23 Tests
- 9 functional tests (basic features work)
- 7 adversarial tests (edge cases handled)
- 7 false positive tests (correctness validated)
All 23 MUST pass before claiming implementation works.
# Run all tests
python docs/research/path_tracking_tests.py # 9/9 ✓
python docs/research/path_adversarial_tests.py # 7/7 ✓
python docs/research/test_false_positives.py # 7/7 ✓
Critical Implementation Requirements
- Check ALL predicates - Not just some (dead-end trap: 48% error)
- Respect edge direction - Forward vs reverse (50% error if undirected)
- Use exact operators - < not <= (59% error with wrong op)
- AND not OR - For conjunctions (88% error with OR!)
- Track through chains - Maintain constraints (74% error at 5 hops)
- Detect loops - When n1 != n3 (33% error accepting loops)
- Edge order matters - e1.w < e2.w not reversed (100% wrong)
Bottom Line
This feature is a correctness minefield. The tests show that naive implementations will return 33-88% WRONG results depending on the mistake.
Any implementation must pass all 23 tests with zero false positives.