datafusion
datafusion copied to clipboard
Stop copying `Expr`s and LogicalPlans so much during Common Subexpression Elimination
Is your feature request related to a problem or challenge?
The common subexpression elimination pass copies many Exprs around. You can see this performance impact this has by looking at the screenshot from https://github.com/apache/arrow-datafusion/issues/9637#issue-2189931564
While we will fix the copying plan problem in https://github.com/apache/arrow-datafusion/issues/9637 I think there is more work to be done in the common sub expression code itself, which copies a significant number of Exprs and Strings around
Describe the solution you'd like
Figure out how to avoid cloneing Exprs in the https://github.com/apache/arrow-datafusion/blob/main/datafusion/optimizer/src/common_subexpr_eliminate.rs
- The
Exprs themselves - Avoid creating Strings for
Identifier
We should see a significant improvement in the sql_planner benchmarks:
cargo bench --bench sql_planner
Describe alternatives you've considered
No response
Additional context
I noticed this while reviewing https://github.com/apache/arrow-datafusion/pull/9871
I'm happy to take this issue, but I just noticed that there is a draft PR already: https://github.com/apache/arrow-datafusion/pull/10067. @alamb would you like to finish that PR?
I'm happy to take this issue, but I just noticed that there is a draft PR already: #10067. @alamb would you like to finish that PR?
I am on vacation this week (well, partial vacation as it is school break week) so I likely won't have time to work on #10067 until next week. If you have time to do so that would be great. Otherwise I will pick it up the following week
CSE is one of the passes that shows up in the planning benchmarks when I last profiled it, so I think this particular change will be very impactful to planning performance
(Thank you @peter-toth 🙏 -- it is much fun working with you)
Ok, got it. I'm happy to look into it, but not sure I can do it before Monday. Enjoy your partial vacation! 😄
I've started working on this, but it will surely take some time...
@alamb, just a quick update that I'm still working on this. Trying out different ideas... Will try to open a PR this week or next.
Hey I am interested in helping with this. Maybe @peter-toth and I can divide our efforts here? Let me know what you've worked on so far, and I can figure out how to help.
One thing I see in particular that's not directly cloning the LogicalPlans and Exprs, but may be putting pressure on the global allocator, is the Identifiers in the ExprSet which are represented as Strings produced by Displaying the Exprs.
Assuming that the new zero-copy implementation will continue using the ExprSet, maybe I could look into efficiently hashing subexpressions to produce numeric identifiers that are easier to copy.
Actually it looks like ExprSet needs to be completely redesigned because it relies on cloning the Exprs.
I've started a branch for my work. Here is a basic commit that simply swaps the existing code over to using the rewrite API.
I've just opened a PR https://github.com/apache/datafusion/pull/10396. I wanted to continue @alamb's https://github.com/apache/datafusion/pull/10067 but then I ran into a few different issues with the rule so I think we should address those first.
2 notes:
- https://github.com/apache/datafusion/pull/10067 is orthogonal to my PR it would be great if someone could continue that effort as unfortunately I don't have much time to work on DataFusion lately.
- the issue of
Stringidentifiers are explained in my PR. I have a solution for the issue, but https://github.com/apache/datafusion/pull/10396 is already huge and the implementation will be a bit tricky...
https://github.com/apache/datafusion/pull/10067 is orthogonal to my PR it would be great if someone could continue that effort as unfortunately I don't have much time to work on DataFusion lately.
I am happy to do this. Thank you @peter-toth
@peter-toth
the issue of String identifiers are explained in my PR. I have a solution for the issue, but https://github.com/apache/datafusion/pull/10396 is already huge and the implementation will be a bit tricky...
~I am thinking a Hash implementation for Expr would fix this problem, and is something I plan to work on soon. Would be interested to know what approach you think would work best.~
~Having a Hash impl would also allow us to refactor other optimization rules which make heavy use of display_name to generate HashMap keys for Exprs~
UPDATE: It looks like Expr already derives Hash. Is there a reason we're not using that instead of string keys?
UPDATE: It looks like Expr already derives Hash. Is there a reason we're not using that instead of string keys?
I believe CSE may predate the Hash impl for Expr
UPDATE: It looks like Expr already derives Hash. Is there a reason we're not using that instead of string keys?
I forgot to comment on this thread that my detailed answer is on the other: https://github.com/apache/datafusion/issues/10426#issuecomment-2105664520
BTW I am purposely procrastinating on this issue until https://github.com/apache/datafusion/issues/10426 / https://github.com/apache/datafusion/pull/10473 are done to avoid conflicts
Though I may do some preliminary work in smaller PRs
@alamb, I think you shouldn't wait for my https://github.com/apache/datafusion/pull/10473, as there is still some preliminary work required I need to finish (https://github.com/apache/datafusion/pull/10543) and I'm still not sure about a few details of https://github.com/apache/datafusion/pull/10473.
Feel free to proceed with refactoring the rule in your https://github.com/apache/datafusion/pull/10067 to use rewrite() and I will rebase my PR on top of yours and resolve the possible conflicts.
Thanks @peter-toth