spark
spark copied to clipboard
[SPARK-37019][SQL] Add codegen support to array higher-order functions
What changes were proposed in this pull request?
This PR adds codegen support to array based higher order functions except ArraySort. This is my first time playing around with codegen, so definitely looking for any feedback.
A few notes:
- Disabled subexpression elimination for lambda functions (this already was the case because it was CodegenFallback). I plan to explore supprting subexpression elimination inside lambda functions later on, as it will require special handling.
- I set the AtomicReference for all lambda values as well in case a child expression reverts to interpreted evaluation for any reason (CodegenFallback or otherwise)
Why are the changes needed?
To improve performance of array higher-order function operations, letting the children be codegen'd and participate in WholeStageCodegen
Does this PR introduce any user-facing change?
No, only performance improvements.
How was this patch tested?
Existing unit tests, let me know if there's other codegen-specific unit tests I should add.
Can one of the admins verify this patch?
@viirya @HyukjinKwon @cloud-fan any thoughts or know who might have thoughts?
@Kimahriman just out of curiosity, how much did the performance improve?
It's hard to say because when I tested this out on my production jobs (actually still actively using it), I had several other changes too. I'm not sure if there are any benchmarks involving HOFs? Though it's highly dependent on what the lambda function is, and honestly that's one of the main benefits, the lambda functions themselves can be codegen'd instead of eval'd.
I also have a larger goal to support subexpression elimination inside lambda functions, because that's where I've found our biggest problem is. https://github.com/apache/spark/pull/34727 is also part of that goal.
There seems to be a lot of repetition. Wish it could be avoided somehow but can't help though (beside nit-picking).
Thanks for the review! I tried to get as much common code in the parent classes as I could, can take another pass to see if anything jumps out for deduping
More nitting 😉
Yeah I struggle with the right formatting for multiline things in scala, tried to update all the suggestions, thanks for the tips!