pygraphistry icon indicating copy to clipboard operation
pygraphistry copied to clipboard

Research: GFQL Query Language Interoperability Strategy

Open lmeyerov opened this issue 2 months ago • 0 comments

Objective

Research and plan a strategy for GFQL interoperability with other graph query languages, with a focus on enabling future embedded implementations (e.g., Rust port).

Background

GFQL currently operates as a standalone graph query language embedded in Python. To maximize adoption and enable future architectural evolution (embedded Rust runtime, cross-language support), we should evaluate interoperability with established graph query standards and languages.

Query Languages to Evaluate

1. Cypher (Neo4j, Memgraph, Amazon Neptune)

  • Text-based: Parse Cypher strings and translate to GFQL AST
  • BOLT protocol: Native binary protocol support for Neo4j
  • Questions:
    • Translation fidelity: What Cypher features map cleanly to GFQL?
    • Protocol choice: Text parsing vs BOLT wire protocol?
    • Performance implications: Client-side translation vs server-side?
    • Bidirectional: GFQL → Cypher compilation for remote execution?

2. GQL (ISO/IEC 39075 Standard)

  • Status: Emerging international standard for graph query
  • Adoption: TigerGraph and other vendors
  • Questions:
    • Syntax overlap with Cypher?
    • Standard compliance benefits?
    • Feature gaps vs GFQL?

3. GSQL (TigerGraph)

  • Characteristics: Procedural, pattern-matching focused
  • Questions:
    • Semantic alignment with GFQL's functional composition?
    • Translation complexity?
    • Value proposition for TigerGraph users?

4. Gremlin (Apache TinkerPop, Neptune, Cosmos)

  • Characteristics: Imperative traversal language
  • Integration: Already have Gremlin connector (graphistry.from_gremlin())
  • Questions:
    • Current connector limitations?
    • Gremlin → GFQL AST translation?
    • Bytecode support?

Research Questions

Translation Architecture

Option A: Text → GFQL AST

# Parse external query language strings
cypher_query = "MATCH (n:Person)-[:KNOWS]->(m) RETURN n, m"
gfql_ast = graphistry.from_cypher(cypher_query)
g.gfql(gfql_ast)

Option B: Wire Protocol Native

# Use native binary protocols (e.g., BOLT for Cypher)
g.bolt_query("MATCH (n:Person)-[:KNOWS]->(m) RETURN n, m")

Option C: Bidirectional Compilation

# GFQL → External language for remote execution
gfql_query = [n({'type': 'Person'}), e_forward({'type': 'KNOWS'}), n()]
cypher_string = graphistry.to_cypher(gfql_query)
# Execute on remote Neo4j server

Embedded Runtime Considerations

Key Question: How do we design interop to support future embedded GFQL runtime (Rust, WebAssembly, etc.)?

Constraints:

  • Translation layer should be thin - avoid heavy Python dependencies
  • AST representation should be serializable (already JSON-capable)
  • Wire protocols should be language-agnostic
  • Parser/compiler infrastructure should be portable

Potential Architecture:

┌─────────────────────────────────────────────────┐
│         Application Layer (Python/JS/etc)      │
├─────────────────────────────────────────────────┤
│      Query Language Parsers (Cypher, etc)      │
│              ↓ (generates)                      │
│         GFQL AST (JSON-serializable)            │
├─────────────────────────────────────────────────┤
│      GFQL Runtime (Rust/WASM - embedded)        │
│   - AST execution                               │
│   - DataFrame operations                        │
│   - Optimization                                │
├─────────────────────────────────────────────────┤
│         DataFrame Backends                      │
│   pandas | polars | arrow | duckdb              │
└─────────────────────────────────────────────────┘

Specific Design Questions

  1. Parser Strategy:

    • Build parsers in Python (existing ecosystem) or Rust (performance, portability)?
    • Use existing parser libraries (e.g., pyparsing, pest in Rust)?
    • What's the maintenance burden for multiple language grammars?
  2. Semantic Mapping:

    • Which features don't translate cleanly?
    • How to handle semantic mismatches (e.g., Cypher's OPTIONAL MATCH vs GFQL)?
    • Error reporting for untranslatable queries?
  3. Performance:

    • Client-side translation overhead?
    • Should we support remote execution (push query to server)?
    • Caching/memoization of translated queries?
  4. Standard Compliance:

    • Should GFQL target ISO GQL compliance?
    • Cypher has openCypher standard - align with it?
    • Trade-offs of standard conformance vs GFQL's unique features?
  5. Rust Port Priorities:

    • Core AST execution first, or parsers first?
    • Which DataFrame backend for embedded Rust? (Arrow, Polars?)
    • WebAssembly target for browser-based GFQL?

Deliverables

Phase 1: Research (2-3 weeks)

  • [ ] Survey existing translation tools (e.g., openCypher parsers)
  • [ ] Document semantic mapping tables for each language
  • [ ] Identify feature gaps and untranslatable patterns
  • [ ] Benchmark translation performance overhead
  • [ ] Prototype: Simple Cypher → GFQL translator for common patterns

Phase 2: Architecture Design (1-2 weeks)

  • [ ] Define translation layer architecture
  • [ ] Design AST schema extensions (if needed)
  • [ ] Plan for embedded runtime (Rust port strategy)
  • [ ] Identify parser library candidates (Python + Rust)
  • [ ] Define interop API surface

Phase 3: Prototype (3-4 weeks)

  • [ ] Implement basic Cypher → GFQL translator
  • [ ] Test with real-world Cypher queries
  • [ ] Document translation fidelity
  • [ ] Evaluate performance
  • [ ] Gather user feedback

Phase 4: Embedded Runtime Exploration (Future)

  • [ ] Rust AST execution prototype
  • [ ] DataFrame backend selection (Polars?)
  • [ ] WASM compilation feasibility
  • [ ] Performance benchmarks vs Python

Success Criteria

  • Clear understanding of translation fidelity for each language
  • Documented semantic mapping tables
  • Proof-of-concept translator for at least one language (Cypher)
  • Architecture design that supports future Rust port
  • Performance benchmarks showing acceptable overhead

Related Issues

  • #722 - GFQL path support (Cypher has native path syntax)
  • #755 - Mark mode (related to traversal semantics)
  • #700 - Auto-generate JSON Schema for GFQL wire protocol
  • #651 - GFQL remote predicates fail with 'id' column
  • #696 - Multi-label node matching predicates

Open Questions

  1. Should we prioritize inbound translation (Cypher → GFQL) or outbound (GFQL → Cypher for remote execution)?
  2. Is BOLT protocol support worth the complexity vs text-based Cypher parsing?
  3. Should we target GQL standard compliance as a strategic goal?
  4. What's the timeline for Rust port? Should we design for it now or incrementally?
  5. Should embedded runtime use existing DataFrame libraries (Polars) or custom IR?

Next Steps

  1. Create feature comparison matrix (GFQL vs Cypher/GQL/GSQL/Gremlin)
  2. Document semantic equivalence mappings
  3. Survey existing parser tools
  4. Prototype Cypher → GFQL translator for common patterns
  5. Define roadmap based on findings

Priority: P2 - Strategic direction for GFQL evolution
Estimated Effort: 6-10 weeks for full research + prototype

lmeyerov avatar Oct 19 '25 07:10 lmeyerov