Feature: Unified GFQL API with modality:runtime engine specification and policy hooks
PyGraphistry Feature Request: Unified GFQL API with Modality:Runtime Engine Specification
Summary
Unify .gfql() and .gfql_remote() into a single .gfql() API using compound engine parameter syntax (engine="<modality>:<runtime>"), with policy hooks for custom execution strategies.
Motivation
Current API Limitations
Two separate APIs for similar operations:
# Local execution
g = g.gfql([call('hypergraph', {...})])
# Remote execution
g = g.gfql_remote([call('hypergraph', {...})], engine='auto')
Problems:
- API fragmentation: Users must choose between
.gfql()and.gfql_remote()upfront - No hybrid strategies: Can't easily switch between local/remote based on data size
- Testing complexity: Hard to test both paths without code duplication
- Limited composability: Can't express "use local if small, remote if large"
Real-World Use Case
In GraphistryGPT, we want to:
- Run small hypergraphs locally (< 0.5s, avoid network overhead)
- Run large hypergraphs remotely (GPU acceleration)
- Let users override with explicit
engineparameter for testing/debugging
Current workaround requires:
- Duplicate code paths for local vs remote
- Manual threshold logic:
should_use_local_hypergraph() - Separate test mocking strategies for each path
Proposed Solution
1. Unified API with Compound Engine Syntax
# Unified .gfql() method with compound engine specification:
engine="<modality>:<runtime>"
# Where:
# - modality: 'auto' | 'local' | 'remote'
# - runtime: 'auto' | 'pandas' | 'cudf'
2. Engine Value Semantics
# Modality resolution (where to execute):
engine='auto' # Policy decides local vs remote, then auto-selects runtime
engine='local' # Client-side execution, auto-select runtime
engine='remote' # Server-side execution, auto-select runtime
# Explicit modality + runtime:
engine='local:pandas' # Client-side pandas
engine='local:cudf' # Client-side cudf (requires local GPU)
engine='remote:pandas' # Server-side pandas
engine='remote:cudf' # Server-side cudf (requires server GPU)
# Mixed specificity:
engine='auto:pandas' # Policy decides location, force pandas runtime
engine='local:auto' # Force client-side, auto-select runtime
# Backwards compatibility (shorthand for remote execution):
engine='pandas' # Equivalent to 'remote:pandas'
engine='cudf' # Equivalent to 'remote:cudf'
3. Policy Hooks for Custom Execution Strategy
from graphistry import register_gfql_policy
def my_execution_policy(
operation: ASTCall,
df: pd.DataFrame,
current_engine: str
) -> tuple[str, str]:
"""Custom policy for modality and runtime selection.
Args:
operation: GFQL operation to execute
df: Input DataFrame
current_engine: User-specified engine (may be 'auto')
Returns:
(modality, runtime) tuple where:
- modality: 'local' or 'remote'
- runtime: 'pandas' or 'cudf'
"""
# Parse current_engine
if ':' in current_engine:
modality, runtime = current_engine.split(':', 1)
else:
modality, runtime = current_engine, 'auto'
# Apply custom logic for 'auto' modality
if modality == 'auto':
# Example: Use complexity threshold for hypergraph
if operation.function == 'hypergraph':
entity_cols = operation.params.get('entity_types', df.columns)
complexity = len(df) * (len(entity_cols) ** 2)
modality = 'local' if complexity < 100_000 else 'remote'
else:
modality = 'remote' # Default to remote for other operations
# Apply custom logic for 'auto' runtime
if runtime == 'auto':
if modality == 'local':
runtime = 'cudf' if has_local_gpu() else 'pandas'
else:
runtime = 'auto' # Let server decide
return (modality, runtime)
# Register policy globally
register_gfql_policy(my_execution_policy)
# Or per-plottable instance
g = g.with_gfql_policy(my_execution_policy)
# Use unified API
g = g.gfql([call('hypergraph', {...})], engine='auto')
# Policy decides: small graph -> 'local:pandas', large graph -> 'remote:cudf'
4. Implementation Sketch
class Plottable:
_gfql_policy: Optional[Callable] = None
def gfql(
self,
operations: List[ASTCall],
engine: str = 'auto',
persist: bool = True,
**kwargs
) -> 'Plottable':
"""Unified GFQL execution with automatic local/remote routing.
Args:
operations: GFQL operations to execute
engine: Compound engine spec 'modality:runtime' or simple 'runtime'
Default 'auto' uses policy to decide
persist: For remote execution, persist result on server
**kwargs: Additional arguments passed to execution backend
Returns:
Plottable with operation results
"""
# Parse engine specification
modality, runtime = self._parse_engine(engine)
# Apply policy if registered
if self._gfql_policy:
modality, runtime = self._gfql_policy(
operations[0] if operations else None,
self._nodes if self._nodes is not None else self._edges,
f"{modality}:{runtime}"
)
# Route to appropriate execution backend
if modality == 'local':
return self._gfql_local(operations, runtime=runtime, **kwargs)
elif modality == 'remote':
return self._gfql_remote_impl(operations, engine=runtime, persist=persist, **kwargs)
else:
raise ValueError(f"Invalid modality: {modality}")
def _parse_engine(self, engine: str) -> tuple[str, str]:
"""Parse compound engine spec into (modality, runtime)."""
if ':' in engine:
modality, runtime = engine.split(':', 1)
elif engine in ['pandas', 'cudf']:
# Backwards compatibility: plain runtime means remote
modality, runtime = 'remote', engine
else:
# 'auto', 'local', 'remote'
modality, runtime = engine, 'auto'
return (modality, runtime)
def gfql_remote(self, *args, **kwargs):
"""Deprecated: Use .gfql() with engine='remote:...' instead."""
warnings.warn(
".gfql_remote() is deprecated. Use .gfql(engine='remote:auto') instead.",
DeprecationWarning,
stacklevel=2
)
return self._gfql_remote_impl(*args, **kwargs)
Benefits
For PyGraphistry Users
- Simpler API: Single
.gfql()method for all execution modes - Flexible execution: Easy to switch between local/remote/hybrid strategies
- Better testing: Explicit
engine='local:pandas'for deterministic tests - Performance optimization: Automatic routing based on data characteristics
- Backwards compatible: Existing
engine='pandas'still works
For PyGraphistry Maintainers
- Cleaner API surface: Deprecate
.gfql_remote(), unify on.gfql() - Extensible architecture: Policy hooks allow advanced users to customize
- Future-proof: Easy to add new modalities (e.g.,
'distributed','spark') - Better user experience: Users don't need to understand local vs remote upfront
For GraphistryGPT (Our Use Case)
- Eliminate code duplication: Single execution path with policy
- Testability: Force
engine='local:pandas'in unit tests,engine='remote:pandas'in integration tests - User control: Power users can override with
{'engine': 'local:cudf'}in JSON params - Performance: Automatic hybrid execution based on complexity threshold
Migration Path
Phase 1: Add Unified API (Non-Breaking)
- Implement
.gfql()with compound engine syntax - Keep
.gfql_remote()as-is (no deprecation yet) - Add policy hook support
Phase 2: Soft Deprecation
- Add deprecation warning to
.gfql_remote() - Update documentation to recommend
.gfql() - Provide migration examples
Phase 3: Hard Deprecation (Major Version)
- Remove
.gfql_remote()or make it an alias - Fully migrate to unified API
Examples
Basic Usage
# Auto mode (uses default policy)
g = graphistry.nodes(df).gfql([call('hypergraph', {...})], engine='auto')
# Explicit local execution
g = graphistry.nodes(df).gfql([call('umap', {...})], engine='local:pandas')
# Explicit remote execution with GPU
g = graphistry.nodes(df).gfql([call('umap', {...})], engine='remote:cudf')
# Backwards compatible
g = graphistry.nodes(df).gfql([call('umap', {...})], engine='cudf') # remote:cudf
With Custom Policy
def hybrid_policy(operation, df, current_engine):
"""Use local for small data, remote for large data."""
modality, runtime = parse_engine(current_engine)
if modality == 'auto':
# Complexity-based threshold
if len(df) < 10_000 and len(df.columns) < 10:
modality = 'local'
else:
modality = 'remote'
if runtime == 'auto':
runtime = 'cudf' if has_gpu() else 'pandas'
return (modality, runtime)
graphistry.register_gfql_policy(hybrid_policy)
# Now 'auto' uses our policy
g = graphistry.nodes(large_df).gfql([call('hypergraph', {...})], engine='auto')
# -> Automatically uses 'remote:cudf'
g = graphistry.nodes(small_df).gfql([call('hypergraph', {...})], engine='auto')
# -> Automatically uses 'local:pandas'
Testing
def test_hypergraph_local():
"""Test local execution path explicitly."""
g = graphistry.nodes(df).gfql(
[call('hypergraph', {'entity_types': ['a', 'b']})],
engine='local:pandas' # No mocking needed!
)
assert g._nodes is not None
assert 'nodeID' in g._nodes.columns
def test_hypergraph_remote():
"""Test remote execution path explicitly."""
with mock.patch('graphistry.client'):
g = graphistry.nodes(df).gfql(
[call('hypergraph', {'entity_types': ['a', 'b']})],
engine='remote:pandas' # Explicit remote
)
Open Questions
- Policy composition: Should multiple policies be chainable?
- Async execution: Should
engine='remote'support async/await? - Fallback behavior: If local execution fails, should it auto-retry remotely?
- Observability: Should there be logging/metrics for policy decisions?
- Server-side policies: Should server also have policies for pandas vs cudf selection?
Related Issues
- (Link to any existing PyGraphistry issues about local vs remote execution)
- (Link to any existing issues about engine selection)
Alternatives Considered
Alternative 1: Separate modality and engine parameters
g.gfql([...], modality='auto', engine='auto')
Rejected: Two parameters more verbose than compound syntax
Alternative 2: String enum without compound syntax
g.gfql([...], execution='local_pandas')
Rejected: Loses composability (can't mix 'auto' modality with 'pandas' runtime)
Alternative 3: Keep .gfql() and .gfql_remote() separate
Rejected: Doesn't solve testing/hybrid execution problems
References
- GraphistryGPT implementation: https://github.com/graphistry/graphistrygpt/pull/2063
- Complexity threshold:
rows × entity_cols² < 100,000 - Performance analysis: C06_PERFORMANCE_ANALYSIS_CORRECTED.md
- Hybrid execution in
runner.py:should_use_local_hypergraph()
- Complexity threshold:
Author: GraphistryGPT team (via Claude Code) Date: 2025-10-19 Priority: Enhancement (improves API ergonomics and enables advanced use cases)
as part of engine abstract -> engine concrete, we can track modality. question of whether to split out & track some resolved ocncrete engine modality, or keep implicit to the compound str and let dynamic calls carve out as needed.
we don't add runtimes quickly, but do expect a few more
i don't think we'll be doing conn strs here, though it is starting to look like it