datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Stop copying `Expr`s and LogicalPlans so much during Common Subexpression Elimination

Open alamb opened this issue 1 year ago • 16 comments

Is your feature request related to a problem or challenge?

The common subexpression elimination pass copies many Exprs around. You can see this performance impact this has by looking at the screenshot from https://github.com/apache/arrow-datafusion/issues/9637#issue-2189931564

While we will fix the copying plan problem in https://github.com/apache/arrow-datafusion/issues/9637 I think there is more work to be done in the common sub expression code itself, which copies a significant number of Exprs and Strings around

312892336-1c8214cc-09ec-41b2-aa5d-f52a5dfa4226

Describe the solution you'd like

Figure out how to avoid cloneing Exprs in the https://github.com/apache/arrow-datafusion/blob/main/datafusion/optimizer/src/common_subexpr_eliminate.rs

  1. The Exprs themselves
  2. Avoid creating Strings for Identifier

We should see a significant improvement in the sql_planner benchmarks:

cargo bench --bench sql_planner

Describe alternatives you've considered

No response

Additional context

I noticed this while reviewing https://github.com/apache/arrow-datafusion/pull/9871

alamb avatar Mar 30 '24 12:03 alamb

I'm happy to take this issue, but I just noticed that there is a draft PR already: https://github.com/apache/arrow-datafusion/pull/10067. @alamb would you like to finish that PR?

peter-toth avatar Apr 18 '24 10:04 peter-toth

I'm happy to take this issue, but I just noticed that there is a draft PR already: #10067. @alamb would you like to finish that PR?

I am on vacation this week (well, partial vacation as it is school break week) so I likely won't have time to work on #10067 until next week. If you have time to do so that would be great. Otherwise I will pick it up the following week

CSE is one of the passes that shows up in the planning benchmarks when I last profiled it, so I think this particular change will be very impactful to planning performance

alamb avatar Apr 19 '24 11:04 alamb

(Thank you @peter-toth 🙏 -- it is much fun working with you)

alamb avatar Apr 19 '24 11:04 alamb

Ok, got it. I'm happy to look into it, but not sure I can do it before Monday. Enjoy your partial vacation! 😄

peter-toth avatar Apr 19 '24 11:04 peter-toth

I've started working on this, but it will surely take some time...

peter-toth avatar Apr 22 '24 17:04 peter-toth

@alamb, just a quick update that I'm still working on this. Trying out different ideas... Will try to open a PR this week or next.

peter-toth avatar May 01 '24 17:05 peter-toth

Hey I am interested in helping with this. Maybe @peter-toth and I can divide our efforts here? Let me know what you've worked on so far, and I can figure out how to help.

One thing I see in particular that's not directly cloning the LogicalPlans and Exprs, but may be putting pressure on the global allocator, is the Identifiers in the ExprSet which are represented as Strings produced by Displaying the Exprs.

Assuming that the new zero-copy implementation will continue using the ExprSet, maybe I could look into efficiently hashing subexpressions to produce numeric identifiers that are easier to copy.

erratic-pattern avatar May 06 '24 18:05 erratic-pattern

Actually it looks like ExprSet needs to be completely redesigned because it relies on cloning the Exprs.

I've started a branch for my work. Here is a basic commit that simply swaps the existing code over to using the rewrite API.

erratic-pattern avatar May 06 '24 19:05 erratic-pattern

I've just opened a PR https://github.com/apache/datafusion/pull/10396. I wanted to continue @alamb's https://github.com/apache/datafusion/pull/10067 but then I ran into a few different issues with the rule so I think we should address those first.

2 notes:

  • https://github.com/apache/datafusion/pull/10067 is orthogonal to my PR it would be great if someone could continue that effort as unfortunately I don't have much time to work on DataFusion lately.
  • the issue of String identifiers are explained in my PR. I have a solution for the issue, but https://github.com/apache/datafusion/pull/10396 is already huge and the implementation will be a bit tricky...

peter-toth avatar May 06 '24 19:05 peter-toth

https://github.com/apache/datafusion/pull/10067 is orthogonal to my PR it would be great if someone could continue that effort as unfortunately I don't have much time to work on DataFusion lately.

I am happy to do this. Thank you @peter-toth

alamb avatar May 07 '24 19:05 alamb

@peter-toth

the issue of String identifiers are explained in my PR. I have a solution for the issue, but https://github.com/apache/datafusion/pull/10396 is already huge and the implementation will be a bit tricky...

~I am thinking a Hash implementation for Expr would fix this problem, and is something I plan to work on soon. Would be interested to know what approach you think would work best.~

~Having a Hash impl would also allow us to refactor other optimization rules which make heavy use of display_name to generate HashMap keys for Exprs~

UPDATE: It looks like Expr already derives Hash. Is there a reason we're not using that instead of string keys?

erratic-pattern avatar May 10 '24 21:05 erratic-pattern

UPDATE: It looks like Expr already derives Hash. Is there a reason we're not using that instead of string keys?

I believe CSE may predate the Hash impl for Expr

alamb avatar May 11 '24 11:05 alamb

UPDATE: It looks like Expr already derives Hash. Is there a reason we're not using that instead of string keys?

I forgot to comment on this thread that my detailed answer is on the other: https://github.com/apache/datafusion/issues/10426#issuecomment-2105664520

peter-toth avatar May 11 '24 11:05 peter-toth

BTW I am purposely procrastinating on this issue until https://github.com/apache/datafusion/issues/10426 / https://github.com/apache/datafusion/pull/10473 are done to avoid conflicts

Though I may do some preliminary work in smaller PRs

alamb avatar May 14 '24 13:05 alamb

@alamb, I think you shouldn't wait for my https://github.com/apache/datafusion/pull/10473, as there is still some preliminary work required I need to finish (https://github.com/apache/datafusion/pull/10543) and I'm still not sure about a few details of https://github.com/apache/datafusion/pull/10473. Feel free to proceed with refactoring the rule in your https://github.com/apache/datafusion/pull/10067 to use rewrite() and I will rebase my PR on top of yours and resolve the possible conflicts.

peter-toth avatar May 17 '24 10:05 peter-toth

Thanks @peter-toth

alamb avatar May 17 '24 12:05 alamb