grand-cypher
grand-cypher copied to clipboard
Feature request: grammar / `WHERE` namespace extensions
Use case: Cypher supports a robust library for datetime analysis, among other extensions. While it is unrealistic to expect their integration here, it would be nice to allow python equivalents during queries. I envision something like this:
import pandas as pd
GrandCypher(g, namespace={"datetime": pd.to_datetime, ...}).run("""
MATCH (a) --> (b)
WHERE a.date < datetime("2024-01-01")
RETURN a.name
""")
Would love to hear your thoughts.
I was able to achieve something similar with this grammar modification:
%import python.expr_stmt -> python_expr
// Replaces current `where_clause` definition
where_clause: "where"i python_expr
And this implementation:
from lark.reconstruct import Reconstructor
# Add global items here such as "to_datetime": pd.to_datetime
WHERE_EXPRESSION_GLOBALS = {}
...
class _CypherNamespace(dict):
"""
dot.notation access to dictionary attributes, useful for enabling cypher filtering
syntax like `m.born < to_datetime("1990")`
"""
def __getattr__(self, attr):
out = self.get(attr)
if isinstance(out, dict):
return type(self)(out)
return out
__setattr__ = dict.__setitem__
__delattr__ = dict.__delitem__
...
_CypherGrammar = Lark(..., maybe_placeholders=False)
reconstructor = Reconstructor(_CypherGrammar)
...
# Update these CypherTransformer methods
def where_clause(self, where_clause: list[Tree]):
self.where_string = reconstructor.reconstruct(where_clause[0])
def _new_where_condition(cname_value_map: dict, target_graph: nx.DiGraph, _):
if not self.where_string:
return True, []
eval_locals = {
cname: _CypherNamespace(target_graph.nodes[value])
for cname, value in cname_value_map.items()
}
result = eval(self.where_string, WHERE_EXPRESSION_GLOBALS, eval_locals)
return result, [result]
Currently, it assumes a hard-coded list of globals that the user can update with their own values.
It was a fun intro to lark 🙂
@ntjess just wanted to pop in and tell you this looks so awesome — I need to take a closer look here (and in #59) but I'm unfortunately in the midst of my phd dissertation prooposal process and it's using up all my cycles for the next wee or so. But I love what you did here! I'm brainstorming about how we can integrate into the official codebase while keeping back-compat and vuln surface-area low!!
back-compat
One option is to add python_expr at the end instead of replacing the current where matches. In cases that match python expressions, the same behavior will result. In cases (like contains) that are not valid python, Lark should resolve in favor of legacy behaivor.
vuln surface-area low
I've discovered pd.eval which can help with this. It limits python "control codes" like dot-access and module imports, but of course there will always be vulnerabilities associated with eval'ing python code...
I wonder if a pattern similar to what the python sqlite3 module does is one to consider. Check out this create_function documentation.
I'd be concerned from a security standpoint to introduce evals into the query syntax. Seems like a lot could go wrong there.
I'll also just drop here that if we support create_function / namespace style additions, we should totally make sure that there's a clear error message that says disambiguates things like
- this is not recognized syntax
- this is a stdlib function but we haven't implemented it yet
- this is not a recognized function (maybe you forgot a namespace/create_function?)
I prefer that we have better controlls over the functions than eval any python_expr.
If I understand correctly, we are going to have custom namespace functions, which can be used in where clause, especially on the "value" side. I think the parser itself can support namespace_functions setting in the where clause.
Agreed! Possibly the best way to do this is at the hints level; for more fine-grained Python interweaving into queries it might make sense to work with grandiso directly!
hi @j6k4m8 , I'm not sure how to link this issue with a MR. I'm going to paste the MR here.
Basically, I support a scope_functions setting in the GrandCypher initialization.
def test_nested_functions(self, graph_type):
host = graph_type()
host.add_node(1, d=date(2025, 10, 31))
host.add_node(2, d=date(2025, 11, 30))
host.add_node(3, d=date(2025, 12, 31))
qry = """
MATCH (A)
WHERE A.d < date(int("2025"), add(5, 7), 1)
RETURN ID(A)
"""
res = GrandCypher(host, scope_functions={"date": date, "int": int, "add": lambda a, b: a + b}).run(qry)
assert res["ID(A)"] == [1, 2]
https://github.com/aplbrain/grand-cypher/pull/80