spark [SPARK-37019][SQL] Add codegen support to array higher-order functions

[SPARK-37019][SQL] Add codegen support to array higher-order functions

Open Kimahriman opened this issue 3 years ago • 7 comments

What changes were proposed in this pull request?

This PR adds codegen support to array based higher order functions except ArraySort. This is my first time playing around with codegen, so definitely looking for any feedback.

A few notes:

Disabled subexpression elimination for lambda functions (this already was the case because it was CodegenFallback). I plan to explore supprting subexpression elimination inside lambda functions later on, as it will require special handling.
I set the AtomicReference for all lambda values as well in case a child expression reverts to interpreted evaluation for any reason (CodegenFallback or otherwise)

Why are the changes needed?

To improve performance of array higher-order function operations, letting the children be codegen'd and participate in WholeStageCodegen

Does this PR introduce any user-facing change?

No, only performance improvements.

How was this patch tested?

Existing unit tests, let me know if there's other codegen-specific unit tests I should add.

Nov 11 '21 13:11 Kimahriman

Can one of the admins verify this patch?

Nov 11 '21 14:11 AmplabJenkins

@viirya @HyukjinKwon @cloud-fan any thoughts or know who might have thoughts?

Jan 02 '22 14:01 Kimahriman

@Kimahriman just out of curiosity, how much did the performance improve?

Mar 15 '22 05:03 Tagar

It's hard to say because when I tested this out on my production jobs (actually still actively using it), I had several other changes too. I'm not sure if there are any benchmarks involving HOFs? Though it's highly dependent on what the lambda function is, and honestly that's one of the main benefits, the lambda functions themselves can be codegen'd instead of eval'd.

I also have a larger goal to support subexpression elimination inside lambda functions, because that's where I've found our biggest problem is. https://github.com/apache/spark/pull/34727 is also part of that goal.

Mar 15 '22 11:03 Kimahriman

There seems to be a lot of repetition. Wish it could be avoided somehow but can't help though (beside nit-picking).

Thanks for the review! I tried to get as much common code in the parent classes as I could, can take another pass to see if anything jumps out for deduping

Apr 04 '23 20:04 Kimahriman

More nitting 😉

Yeah I struggle with the right formatting for multiline things in scala, tried to update all the suggestions, thanks for the tips!

Jun 23 '23 00:06 Kimahriman

spark spark copied to clipboard

[SPARK-37019][SQL] Add codegen support to array higher-order functions

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

spark
spark copied to clipboard