hatchet icon indicating copy to clipboard operation
hatchet copied to clipboard

Enhancements to user interface when using QL with row MultiIndex

Open ilumsden opened this issue 1 year ago • 4 comments

In #76, I added a new multi_index_mode parameter to GraphFrame.filter and the query language to allow us to apply queries to GraphFrames where we have a row MultiIndex. However, since then, there's been a lot of confusion about the parameter, what is does, and how to use it, especially in Thicket.

This PR improves naming, simplifies default use, and enhances functionality of this feature. More specifically, this PR does 3 things:

  1. Renames multi_index_mode to predicate_row_aggregator, which more clearly indicates that the argument is used to aggregate per-row outputs from predicates
  2. Expands the acceptable values to predicate_row_aggregator
  3. Adds a new mechanism that allows the query classes (i.e., Query, ObjectQuery, StringQuery) to define a default aggregator
  4. Moves logic for applying aggregators to QueryEngine, which allows us to bypass all of this if we don't have a row MultiIndex

With this PR, the predicate_row_aggregator argument now accepts the following:

  • None: tells Hatchet to use the default aggregator for the type of query
  • "off": tells Hatchet to not use any aggregators (note: this will result in errors if there is a row MultiIndex)
  • "all": applies an aggregator that returns true if and only if the predicate returned true for all rows associated with a node
  • "any": applies an aggregator that returns true if the predicate returned true for any row associated with a node
  • Callable that takes a pandas.Series of booleans as input and returns a boolean as output: applies the user-provided function as an aggregator

When using predicate_row_aggregator=None, the aggregators used will be:

  • "off" if using a base syntax query (corresponds to the Query class)
  • "all" if using a object or string dialect query (corresponds to the ObjectQuery and StringQuery classes)
  • the default aggregators for each subquery if using a compound query

ilumsden avatar Nov 08 '24 03:11 ilumsden

To clarify, the reason we need multi_index_mode/predicate_row_aggregator is because the graph algorithm-part of the query language needs predicates to provide a single boolean for each node. When we do not have a row MultiIndex (i.e., the standard case for Hatchet), this requirement is always satisfied. However, when we do have a row MultiIndex (i.e., the standard case for Thicket), this requirement is never satisfied because we have multiple rows in the DataFrame per node. As a result, predicates will return a pandas.Series of booleans when we have a row MultiIndex. The multi_index_mode/predicate_row_aggregator argument provides a mechanism to aggregate that Series of booleans into a single boolean.

ilumsden avatar Nov 13 '24 18:11 ilumsden

An example of where this aggregation argument is relevant. Example base-syntax query to match nodes with name "my_node" where aggregation does not need to be specified due to .all()

query = th.query.Query().match(
    "*",
    lambda row: row["name"].apply(
        lambda tn: tn == "my_node"
    ).all()
)
tkq = tk.query(query)

Equivalent string syntax query where specifying aggregation is necessary

query = """
MATCH ("*")->(n) WHERE n."name"="my_node"
"""
filt = tk.query(query, predicate_row_aggregator="all")

michaelmckinsey1 avatar Nov 13 '24 23:11 michaelmckinsey1

Matching a single node with name my_node

query = th.query.Query().match(
    1,
    lambda row: row["name"].apply(
        lambda tn: tn == "my_node"
    ).all()
)
tkq = tk.query(query)

or

query = th.query.Query().match(
    ".",
    lambda row: row["name"].apply(
        lambda tn: tn == "my_node"
    ).all()
)
tkq = tk.query(query)

michaelmckinsey1 avatar Nov 15 '24 17:11 michaelmckinsey1

@slabasan I'm removing this from the upcoming release. The enhancements added by this PR complicate the process of building queries from the string dialect because I need to know whether or not the DataFrame has a multi-index. Given the other work I have to do, trying to get this done in time for the release is likely not feasible.

ilumsden avatar Mar 14 '25 19:03 ilumsden