Enhancements to user interface when using QL with row MultiIndex
In #76, I added a new multi_index_mode parameter to GraphFrame.filter and the query language to allow us to apply queries to GraphFrames where we have a row MultiIndex. However, since then, there's been a lot of confusion about the parameter, what is does, and how to use it, especially in Thicket.
This PR improves naming, simplifies default use, and enhances functionality of this feature. More specifically, this PR does 3 things:
- Renames
multi_index_modetopredicate_row_aggregator, which more clearly indicates that the argument is used to aggregate per-row outputs from predicates - Expands the acceptable values to
predicate_row_aggregator - Adds a new mechanism that allows the query classes (i.e.,
Query,ObjectQuery,StringQuery) to define a default aggregator - Moves logic for applying aggregators to
QueryEngine, which allows us to bypass all of this if we don't have a rowMultiIndex
With this PR, the predicate_row_aggregator argument now accepts the following:
-
None: tells Hatchet to use the default aggregator for the type of query -
"off": tells Hatchet to not use any aggregators (note: this will result in errors if there is a rowMultiIndex) -
"all": applies an aggregator that returns true if and only if the predicate returned true for all rows associated with a node -
"any": applies an aggregator that returns true if the predicate returned true for any row associated with a node - Callable that takes a
pandas.Seriesof booleans as input and returns a boolean as output: applies the user-provided function as an aggregator
When using predicate_row_aggregator=None, the aggregators used will be:
-
"off"if using a base syntax query (corresponds to theQueryclass) -
"all"if using a object or string dialect query (corresponds to theObjectQueryandStringQueryclasses) - the default aggregators for each subquery if using a compound query
To clarify, the reason we need multi_index_mode/predicate_row_aggregator is because the graph algorithm-part of the query language needs predicates to provide a single boolean for each node. When we do not have a row MultiIndex (i.e., the standard case for Hatchet), this requirement is always satisfied. However, when we do have a row MultiIndex (i.e., the standard case for Thicket), this requirement is never satisfied because we have multiple rows in the DataFrame per node. As a result, predicates will return a pandas.Series of booleans when we have a row MultiIndex. The multi_index_mode/predicate_row_aggregator argument provides a mechanism to aggregate that Series of booleans into a single boolean.
An example of where this aggregation argument is relevant. Example base-syntax query to match nodes with name "my_node" where aggregation does not need to be specified due to .all()
query = th.query.Query().match(
"*",
lambda row: row["name"].apply(
lambda tn: tn == "my_node"
).all()
)
tkq = tk.query(query)
Equivalent string syntax query where specifying aggregation is necessary
query = """
MATCH ("*")->(n) WHERE n."name"="my_node"
"""
filt = tk.query(query, predicate_row_aggregator="all")
Matching a single node with name my_node
query = th.query.Query().match(
1,
lambda row: row["name"].apply(
lambda tn: tn == "my_node"
).all()
)
tkq = tk.query(query)
or
query = th.query.Query().match(
".",
lambda row: row["name"].apply(
lambda tn: tn == "my_node"
).all()
)
tkq = tk.query(query)
@slabasan I'm removing this from the upcoming release. The enhancements added by this PR complicate the process of building queries from the string dialect because I need to know whether or not the DataFrame has a multi-index. Given the other work I have to do, trying to get this done in time for the release is likely not feasible.