datafusion Stop copying `Expr`s and LogicalPlans so much during Common Subexpression Elimination

Is your feature request related to a problem or challenge?

The common subexpression elimination pass copies many Exprs around. You can see this performance impact this has by looking at the screenshot from https://github.com/apache/arrow-datafusion/issues/9637#issue-2189931564

While we will fix the copying plan problem in https://github.com/apache/arrow-datafusion/issues/9637 I think there is more work to be done in the common sub expression code itself, which copies a significant number of Exprs and Strings around

Describe the solution you'd like

Figure out how to avoid cloneing Exprs in the https://github.com/apache/arrow-datafusion/blob/main/datafusion/optimizer/src/common_subexpr_eliminate.rs

The Exprs themselves
Avoid creating Strings for Identifier

We should see a significant improvement in the sql_planner benchmarks:

cargo bench --bench sql_planner

Describe alternatives you've considered

No response

Additional context

I noticed this while reviewing https://github.com/apache/arrow-datafusion/pull/9871

Mar 30 '24 12:03 alamb

I'm happy to take this issue, but I just noticed that there is a draft PR already: https://github.com/apache/arrow-datafusion/pull/10067. @alamb would you like to finish that PR?

Apr 18 '24 10:04 peter-toth

I'm happy to take this issue, but I just noticed that there is a draft PR already: #10067. @alamb would you like to finish that PR?

I am on vacation this week (well, partial vacation as it is school break week) so I likely won't have time to work on #10067 until next week. If you have time to do so that would be great. Otherwise I will pick it up the following week

CSE is one of the passes that shows up in the planning benchmarks when I last profiled it, so I think this particular change will be very impactful to planning performance

Apr 19 '24 11:04 alamb

(Thank you @peter-toth 🙏 -- it is much fun working with you)

Apr 19 '24 11:04 alamb

Ok, got it. I'm happy to look into it, but not sure I can do it before Monday. Enjoy your partial vacation! 😄

Apr 19 '24 11:04 peter-toth

I've started working on this, but it will surely take some time...

Apr 22 '24 17:04 peter-toth

@alamb, just a quick update that I'm still working on this. Trying out different ideas... Will try to open a PR this week or next.

May 01 '24 17:05 peter-toth

Hey I am interested in helping with this. Maybe @peter-toth and I can divide our efforts here? Let me know what you've worked on so far, and I can figure out how to help.

One thing I see in particular that's not directly cloning the LogicalPlans and Exprs, but may be putting pressure on the global allocator, is the Identifiers in the ExprSet which are represented as Strings produced by Displaying the Exprs.

Assuming that the new zero-copy implementation will continue using the ExprSet, maybe I could look into efficiently hashing subexpressions to produce numeric identifiers that are easier to copy.

May 06 '24 18:05 erratic-pattern

Actually it looks like ExprSet needs to be completely redesigned because it relies on cloning the Exprs.

I've started a branch for my work. Here is a basic commit that simply swaps the existing code over to using the rewrite API.

May 06 '24 19:05 erratic-pattern

I've just opened a PR https://github.com/apache/datafusion/pull/10396. I wanted to continue @alamb's https://github.com/apache/datafusion/pull/10067 but then I ran into a few different issues with the rule so I think we should address those first.

2 notes:

https://github.com/apache/datafusion/pull/10067 is orthogonal to my PR it would be great if someone could continue that effort as unfortunately I don't have much time to work on DataFusion lately.
the issue of String identifiers are explained in my PR. I have a solution for the issue, but https://github.com/apache/datafusion/pull/10396 is already huge and the implementation will be a bit tricky...

May 06 '24 19:05 peter-toth

https://github.com/apache/datafusion/pull/10067 is orthogonal to my PR it would be great if someone could continue that effort as unfortunately I don't have much time to work on DataFusion lately.

I am happy to do this. Thank you @peter-toth

May 07 '24 19:05 alamb

@peter-toth

the issue of String identifiers are explained in my PR. I have a solution for the issue, but https://github.com/apache/datafusion/pull/10396 is already huge and the implementation will be a bit tricky...

~I am thinking a Hash implementation for Expr would fix this problem, and is something I plan to work on soon. Would be interested to know what approach you think would work best.~

~Having a Hash impl would also allow us to refactor other optimization rules which make heavy use of display_name to generate HashMap keys for Exprs~

UPDATE: It looks like Expr already derives Hash. Is there a reason we're not using that instead of string keys?

May 10 '24 21:05 erratic-pattern

UPDATE: It looks like Expr already derives Hash. Is there a reason we're not using that instead of string keys?

I believe CSE may predate the Hash impl for Expr

May 11 '24 11:05 alamb

UPDATE: It looks like Expr already derives Hash. Is there a reason we're not using that instead of string keys?

I forgot to comment on this thread that my detailed answer is on the other: https://github.com/apache/datafusion/issues/10426#issuecomment-2105664520

May 11 '24 11:05 peter-toth

BTW I am purposely procrastinating on this issue until https://github.com/apache/datafusion/issues/10426 / https://github.com/apache/datafusion/pull/10473 are done to avoid conflicts

Though I may do some preliminary work in smaller PRs

May 14 '24 13:05 alamb

@alamb, I think you shouldn't wait for my https://github.com/apache/datafusion/pull/10473, as there is still some preliminary work required I need to finish (https://github.com/apache/datafusion/pull/10543) and I'm still not sure about a few details of https://github.com/apache/datafusion/pull/10473. Feel free to proceed with refactoring the rule in your https://github.com/apache/datafusion/pull/10067 to use rewrite() and I will rebase my PR on top of yours and resolve the possible conflicts.

May 17 '24 10:05 peter-toth

Thanks @peter-toth

May 17 '24 12:05 alamb

datafusion datafusion copied to clipboard

Stop copying `Expr`s and LogicalPlans so much during Common Subexpression Elimination

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

datafusion
datafusion copied to clipboard