datafusion Make `CommonSubexprEliminate` faster by avoiding the use of strings

Is your feature request related to a problem or challenge?

Part of https://github.com/apache/datafusion/issues/5637

One of the optimizer passes is "common subexpression elimination" that removes redundant computation

However, as @peter-toth noted on https://github.com/apache/datafusion/pull/10396 and the CSE code says

https://github.com/apache/datafusion/blob/d58bae487329b7a7078429f083bffc611f42c8c7/datafusion/optimizer/src/common_subexpr_eliminate.rs#L108-L119

The way it tracks common subexpressions is with string manipulation is is non ideal for several reasons (including the cost of creating those strings)

Describe the solution you'd like

Revisit the identifiers as using these string identifiers as the keys of ExprStats was not the best choice. Please note this is how CSE has been working since the feature was added initially.

Describe alternatives you've considered

No response

Additional context

No response

May 08 '24 17:05 alamb

I'm happy to take this.

May 08 '24 17:05 peter-toth

Are there any potential issues with simply using the existing Hash implementation of Expr to create HashSets?

Serveral other optimization passes use string names as keys for expressions in data structures. I am wondering if any of these could also be refactored to simply use HashSet<Expr> or HashSet<&Expr>

synthetic group by expressions for aggregates: https://github.com/apache/datafusion/blob/accce9732e26723cab2ffc521edbf5a3fe7460b3/datafusion/expr/src/logical_plan/builder.rs#L1246-L1270

functional dependencies heavily uses display_name to represent group by exprs: https://github.com/apache/datafusion/blob/main/datafusion/common/src/functional_dependencies.rs

decorrelate: https://github.com/apache/datafusion/blob/accce9732e26723cab2ffc521edbf5a3fe7460b3/datafusion/optimizer/src/decorrelate.rs#L65

push down filter for aggregates: https://github.com/apache/datafusion/blob/accce9732e26723cab2ffc521edbf5a3fe7460b3/datafusion/optimizer/src/push_down_filter.rs#L788-L837

single distinct to group by: https://github.com/apache/datafusion/blob/accce9732e26723cab2ffc521edbf5a3fe7460b3/datafusion/optimizer/src/single_distinct_to_groupby.rs#L69-L96 https://github.com/apache/datafusion/blob/accce9732e26723cab2ffc521edbf5a3fe7460b3/datafusion/optimizer/src/single_distinct_to_groupby.rs#L185

May 10 '24 22:05 erratic-pattern

Are there any potential issues with simply using the existing Hash implementation of Expr to create HashSets?

Serveral other optimization passes use string names as keys for expressions in data structures. I am wondering if any of these could also be refactored to simply use HashSet<Expr> or HashSet<&Expr>

Thanks for these references @erratic-pattern.

Background and general thoughts:

I'm only familiar with CSE code and in its case unfortunately non-unique stringified expression were used as keys of the map that stores the occurrance counts. This bug was introduced in https://github.com/apache/datafusion/pull/9871 and reverted in https://github.com/apache/datafusion/pull/10396. The issue with these colliding string keys are explained here in details: https://github.com/apache/datafusion/pull/10333#discussion_r1595242837.

Some thougths about CSE:

After https://github.com/apache/datafusion/pull/10396 we still use stringified expressions as keys (Identifier), but the strings we use encode whole expression subtrees. This is far from optimal and this ticket / my work in progress change would like to help with that.

In case of CSE we could use Expr as keys of the ExprStats map, but then we would need to clone Exprs when we fill up the ExprStats map during the first traversal. This would be particulary costly in CSE because we need to store not only the counts for all top level expressions, but the counts of all their descendant subexpressions.

We could also use &Expr as keys (and so we didn't need to clone the expressions), but there is a problem here. The current TreeNode::apply() / TreeNode::visit() APIs aren't capable to fill up such a HashMap<&Expr, ...> map. This is because of restricted TreeNode reference lifetimes used in closures / TreeNodeVisitor methods. I.e. this currently doesn't work:

let e = sum((col("a") * (lit(1) - col("b"))) * (lit(1) + col("c")));
let mut m = HashMap::new();
e.apply(|e| {
    *m.entry(e).or_insert(0) += 1;
    Ok(TreeNodeRecursion::Continue)
});

println!("m: {:#?}", m);

This issue can be solved by adding new TreeNode APIs or fixing the current ones. I have a WIP commit here: https://github.com/peter-toth/arrow-datafusion/commit/e8447996462ae4710b573c7088bc6d8b1e586cfb that adds TreeNode::apply_ref() / TreeNode::visit_ref(). Using apply_ref() in the above example would make it work, but I haven't opened a PR yet as there are a few things to consider: a. We don't really want to add any more new APIs (especially if their puspose is similar to existing ones). b. We can't change the lifetimes of references in the current apply() / visit() easily. This is because some TreeNode implementations are not compatible with that. (E.g. DynTreeNode doesn't have a method to get references to its children, LogicalPlan creates temprorary objects in its map_subqueries(), ...).

Despite the fact that my WIP commit adds new APIs, I would prefer and lean towards option b.. But since I'm only aware of this ticket that requires this change to the APIs, I haven't opened the PR yet.

Now there is another thing to consider if we want use &Expr as keys of ExprStats. The current CSE algorithm, that was added by the original author of CSE in DataFusion (and not myself), is very clever and does the following: In the first traversal it:

Creates a mapping for each top level expression (this is called IdArray) that stores the preorder visit index of a node to an Identifier (of a subexpression tree).
And also creates a map (this is called ExprStats) that contains the Identifier -> count stats gathered for all top level expressions and their subexpressions.

This is very nice, because the second, rewriting traversal can use the preorder visit index again to look up the identifier first and then the count from the ExprStats map. Providing that an identifier is small, this can be much faster then using &Expr as keys because:

Computing hash() of an &Expr (instead of using preorder index) in the second traversal is costly if the expression is deep and contains lots of indirections (Boxes).
When we generate the identifiers in the first traversal we can use the traversal's bottom-up phase to build up identifiers from the current node and the identifiers of the node's children very effectively.

In my work in progress change for this issue I would like to finalize the:

TreeNode API changes required (maybe open a separate PR for it)
and replce the current String based identifier to a (u64, &Expr) like tuple/struct. The first item contains a precomputed hash of the identifier. (As I mentioned, we can use the bottom-up phase of the first traversal to compute that effectively since this logic is already implmeneted in the CSE algorithm.) The overriden hash() of the struct should return this precomputed hash. The second item is a &Expr that can be used in the struct's eq() implementation in case of hash collision.

Back to the original question of using `HashSet<Expr, ...>` or `HashSet<&Expr, ...>`:

I think both are accepable but CSE is special as the maps need to store all the descendant subexpressions as well and the impemented CSE algorithm seems to offer a way to implement a better identifier than just a simple &Expr. I don't know the other referenced usecases but if collision of string names can happen there then we should definitely fix it.

May 11 '24 10:05 peter-toth

Thanks for the detailed write up @peter-toth . Though I did mention HashSet<Expr> specifically, my suggestion more generally goes along the lines of using the Hash implementation in some way to produce the identifiers. After looking at the code a bit more, I do see the cloning/lifetime issues with using Expr or &Expr as keys directly. I also did not consider the cost of re-computing hashes. I do think in that case it does make sense to pre-compute the hash instead.

I like the idea of generalizing the (u64, &Expr) struct into something reuseable across optimizations, as it seems to be a common pattern where we need to:

produce some unique identifier for an expression that can be stored in a data structure
use that identifier to generate aliases for newly generated expressions, or create a new Column/Field somewhere with that expression as a name. this can be done thanks to the &Expr in the struct which would allow us to call display_name
do so in a way that doesn't conflict with ownership/borrowing semantics. we might still run into borrowing issues because of the &Expr reference, but it's hard to say without trying to adapt this solution to other optimizers. Rc or Arc is a potential option as well. The struct could potentially be generic over Borrow to support any of these.
avoid recomputing the hash/key on every insert/lookup operation

Anyway, I don't want to over-abstract just yet, so for now just build something that works for CSE and then we can take it and see if it can be applied to any of the other optimizations.

I am curious if overriding hash() in this way will conflict with the Hash Eq property in some unforseen way. I think as long as we're constructing it such that the &Expr is always a reference to the Expr that produced the hash, it should be fine.

May 11 '24 12:05 erratic-pattern

I like the idea of generalizing the (u64, &Expr) struct into something reuseable across optimizations.

Honestly, I don't know those referenced usecases, but I feel (u64, &Expr) (and any Identifier in general) makes sense only for CSE (2 traversals, we can build up a preorder visit cache of Identifiers in the first traversal and second traversal is top-down) and not sure the others have the same characteristics... If that's not the case then it doesn't make sense to use Identifiers instead of Expr/&Exprs.

Anyways, I will try to open the PR with it next week and then feel free to generalize the idea for other usecases if it makes sense.

May 11 '24 13:05 peter-toth

I've opened a draft PR: https://github.com/apache/datafusion/pull/10473 and will try to wrap it up in the following days.

May 12 '24 18:05 peter-toth

I have a WIP commit here: https://github.com/peter-toth/arrow-datafusion/commit/e8447996462ae4710b573c7088bc6d8b1e586cfb that adds TreeNode::apply_ref() / TreeNode::visit_ref(). Using apply_ref() in the above example would make it work, but I haven't opened a PR yet as there are a few things to consider: ... But since I'm only aware of this ticket that requires this change to the APIs, I haven't opened the PR yet.

Here is one example API that I would love to implement with such a tree-node api: https://github.com/apache/datafusion/issues/10505

I also ran into an example when trying to find embedded Subquerys in an Expr in https://github.com/apache/datafusion/blob/424757f365ea9eab7b242f5b604ef804baecd666/datafusion/optimizer/src/scalar_subquery_to_join.rs#L54-L68

May 14 '24 13:05 alamb

Here is one example API that I would love to implement with such a tree-node api: https://github.com/apache/datafusion/issues/10505

Thanks for sharing this @alamb. It's good to know that there are other possible usecases for this new API. https://github.com/apache/datafusion/pull/10473 seems to pass all tests now. I will extract the first commit of it into a separate PR today or tomorrow to add the new TreeNode API.

May 14 '24 13:05 peter-toth

FWIW I've also seen the high cost of expression string formatting (using Display/to_string()) in a good number of profiles.

I think there's nothing wrong about having a "display" infrastructure, but it shouldn't be used eagerly. As others pointed out, using a hash or any form of numeric ID would probably be better in many places.

May 15 '24 09:05 crepererum

FWIW I've also seen the high cost of expression string formatting (using Display/to_string()) in a good number of profiles.

I think there's nothing wrong about having a "display" infrastructure, but it shouldn't be used eagerly. As others pointed out, using a hash or any form of numeric ID would probably be better in many places.

100% -- btw https://github.com/apache/datafusion/pull/10454 from @erratic-pattern made this code faster (fewer allocations) though it would be better still as you point out to not use display as much.

I will say from personal experience working on postgres / postgres derived systems (which does use a numeric id to identify columns), using strings is much easier to debug when problems occur. I do think we can reduce it significantly however

May 15 '24 17:05 alamb

I think there's nothing wrong about having a "display" infrastructure, but it shouldn't be used eagerly. As others pointed out, using a hash or any form of numeric ID would probably be better in many places.

@crepererum I am working on moving away from string allocations in a number of the optimization rules and switching to Hash based implementations.

Most of these use the Expr::display_name method which - maybe confusingly - doesn't actually use Display but instead uses an internal create_name function. It is similar to the Display implementation but has some differences. For instance, Cast expressions are ignored and column references are rendered with a different syntax.

I would be interested in seeing the profile data you mentioned, especially those that use Display and to_string as that might indicate that something else could be improved beyond the changes I am currently working on.

Also since we're no longer talking about strictly CommonSubexprEliminate at this point, it might be a good idea to track this as a separate issue and link this discussion.

May 16 '24 20:05 erratic-pattern

I can see if I get organize you some profiles next week :slightly_smiling_face:

May 17 '24 09:05 crepererum

datafusion datafusion copied to clipboard

Make `CommonSubexprEliminate` faster by avoiding the use of strings

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Background and general thoughts:

Some thougths about CSE:

Back to the original question of using HashSet<Expr, ...> or HashSet<&Expr, ...>:

datafusion
datafusion copied to clipboard

Back to the original question of using `HashSet<Expr, ...>` or `HashSet<&Expr, ...>`: