pygraphistry icon indicating copy to clipboard operation
pygraphistry copied to clipboard

Review validation design consistency across GFQL operations

Open lmeyerov opened this issue 2 months ago • 1 comments

Review validation design consistency across GFQL operations

Summary

There's an inconsistency in how validation is controlled across different GFQL execution paths:

  1. AST construction has validate=True parameter (checks structure + internal column patterns)
  2. Direct method calls (g.get_degrees(), g.filter_nodes_by_dict()) ALWAYS validate with no opt-out
  3. GFQL execution (.gfql()) has no validation control parameter

Current Behavior

AST Objects - Controllable Validation ✅

# Can disable validation
n({'__gfql_internal__': 1}, validate=False)  # No error

# Default validation enabled
n({'__gfql_internal__': 1})  # ValueError

Direct Method Calls - Always Validate ❌

# No way to disable validation
g.get_degrees(col='__gfql_degree__')  # Always raises ValueError
g.filter_nodes_by_dict({'__gfql_col__': 1})  # Always raises ValueError

GFQL Call Operations - Always Validate ❌

# Validation happens inside the call, not controllable
g.gfql([
    call('get_degrees', {'col': '__gfql_degree__'})  # Always raises ValueError
])

Design Question

Should validation be controllable consistently across all execution paths?

Option A: Add validate parameter everywhere

# Direct calls
g.get_degrees(col='degree', validate=True)  # default
g.filter_nodes_by_dict(filter_dict, validate=True)  # default

# GFQL execution - inherit from top level
g.gfql([n(), call('get_degrees')], validate=True)  # default, flows down

# AST construction - already exists
n({'col': 1}, validate=True)  # already works

Pros:

  • Consistent API across all execution paths
  • Users can opt out for advanced use cases
  • Top-level gfql() validation flows down to all operations

Cons:

  • Breaking API change (new parameters)
  • More complex - validation state must flow through execution context
  • call() operations shouldn't allow individual validate override (should inherit from gfql)

Option B: Keep current behavior (always validate at runtime)

# AST construction - controllable (parse-time validation)
n({'col': 1}, validate=False)  # Can disable

# Runtime operations - always validate (runtime validation)  
g.get_degrees()  # Always validates
g.filter_nodes_by_dict(...)  # Always validates

Pros:

  • Simpler - no parameter threading needed
  • Clear separation: parse-time vs runtime validation
  • No breaking changes

Cons:

  • Inconsistent API (validate parameter only on AST objects)
  • No way for advanced users to bypass runtime validation
  • gfql() can't control validation of operations it calls

Option C: Document as two validation layers

# Layer 1: Parse-time validation (AST structure + patterns)
ast_obj = n({'__gfql_col__': 1}, validate=False)  # controllable

# Layer 2: Runtime validation (schema checks, always on)
g.get_degrees()  # always validates
g.filter_nodes_by_dict(...)  # always validates
g.gfql([ast_obj])  # re-validates at runtime regardless of AST validate flag

Pros:

  • No breaking changes
  • Clear separation of concerns
  • Simpler implementation

Cons:

  • Still inconsistent
  • Can't disable runtime validation

Affected Code

Files with validation:

  • graphistry/compute/ComputeMixin.py - get_degrees(), get_indegrees(), get_outdegrees(), get_topological_levels()
  • graphistry/compute/filter_by_dict.py - filter_nodes_by_dict(), filter_edges_by_dict()
  • graphistry/compute/ast.py - AST _validate_fields() methods

Validation functions:

  • graphistry/compute/gfql/identifiers.py:
    • validate_column_name() - for output parameters
    • validate_column_references() - for filter dict keys

Related Context

  • PR #788 - Added internal column validation for __gfql_*__ pattern
  • These validations prevent filtering on temporary internal columns
  • Internal columns won't work with gfql_remote() (serialization issue)

Questions

  1. Should validation be consistently controllable across all execution paths?
  2. If yes, should call() operations inherit validation from parent .gfql() call?
  3. Should we distinguish between parse-time validation (AST structure) and runtime validation (schema checks)?
  4. Is there a use case for disabling runtime validation?

Recommendation

Probably Option C (document two layers) unless there's a strong use case for disabling runtime validation. The current behavior is safe and prevents confusing errors downstream.

If we go with Option A, the design should be:

  • gfql(query, validate=True) - controls validation for entire execution
  • Validation flag flows through ExecutionContext
  • Individual operations inherit from context (no per-operation override)
  • call() operations respect inherited validation (can't override)

cc: @lmeyerov

lmeyerov avatar Oct 14 '25 04:10 lmeyerov

basically we should allow both:

  • checked runtime execution
  • query-construction time validation, allowing fail-fast program validation + fast runtime

lmeyerov avatar Oct 14 '25 04:10 lmeyerov