spark icon indicating copy to clipboard operation
spark copied to clipboard

[SPARK-37019][SQL] Add codegen support to array higher-order functions

Open Kimahriman opened this issue 3 years ago • 7 comments

What changes were proposed in this pull request?

This PR adds codegen support to array based higher order functions except ArraySort. This is my first time playing around with codegen, so definitely looking for any feedback.

A few notes:

  • Disabled subexpression elimination for lambda functions (this already was the case because it was CodegenFallback). I plan to explore supprting subexpression elimination inside lambda functions later on, as it will require special handling.
  • I set the AtomicReference for all lambda values as well in case a child expression reverts to interpreted evaluation for any reason (CodegenFallback or otherwise)

Why are the changes needed?

To improve performance of array higher-order function operations, letting the children be codegen'd and participate in WholeStageCodegen

Does this PR introduce any user-facing change?

No, only performance improvements.

How was this patch tested?

Existing unit tests, let me know if there's other codegen-specific unit tests I should add.

Kimahriman avatar Nov 11 '21 13:11 Kimahriman

Can one of the admins verify this patch?

AmplabJenkins avatar Nov 11 '21 14:11 AmplabJenkins

@viirya @HyukjinKwon @cloud-fan any thoughts or know who might have thoughts?

Kimahriman avatar Jan 02 '22 14:01 Kimahriman

@Kimahriman just out of curiosity, how much did the performance improve?

Tagar avatar Mar 15 '22 05:03 Tagar

It's hard to say because when I tested this out on my production jobs (actually still actively using it), I had several other changes too. I'm not sure if there are any benchmarks involving HOFs? Though it's highly dependent on what the lambda function is, and honestly that's one of the main benefits, the lambda functions themselves can be codegen'd instead of eval'd.

I also have a larger goal to support subexpression elimination inside lambda functions, because that's where I've found our biggest problem is. https://github.com/apache/spark/pull/34727 is also part of that goal.

Kimahriman avatar Mar 15 '22 11:03 Kimahriman

There seems to be a lot of repetition. Wish it could be avoided somehow but can't help though (beside nit-picking).

Thanks for the review! I tried to get as much common code in the parent classes as I could, can take another pass to see if anything jumps out for deduping

Kimahriman avatar Apr 04 '23 20:04 Kimahriman

More nitting 😉

Yeah I struggle with the right formatting for multiline things in scala, tried to update all the suggestions, thanks for the tips!

Kimahriman avatar Jun 23 '23 00:06 Kimahriman