spark icon indicating copy to clipboard operation
spark copied to clipboard

[SPARK-39925][SQL] Add array_sort(column, comparator) overload to DataFrame operations

Open brandondahler opened this issue 3 years ago • 1 comments

What changes were proposed in this pull request?

Adding a new array_sort overload to org.apache.spark.sql.functions that matches the new overload defined in SPARK-29020 and added via #25728.

Why are the changes needed?

Adds access to the new overload for users of the DataFrame API so that they don't need to use the expr escape hatch.

Does this PR introduce any user-facing change?

Yes, now allows users to optionally provide a comparator function to the array_sort, which opens up the ability to sort descending as well as sort items that aren't naturally orderable.

Example:

Old:

df.selectExpr("array_sort(a, (x, y) -> cardinality(x) - cardinality(y))");

Added:

df.select(array_sort(col("a"), (x, y) => size(x) - size(y)));

How was this patch tested?

Unit tests updated to validate that the overload matches the expression's behavior.

brandondahler avatar Aug 01 '22 12:08 brandondahler

Can one of the admins verify this patch?

AmplabJenkins avatar Aug 01 '22 20:08 AmplabJenkins

LGTM. are you also interested in adding this in SparkR and PySpark? We can do that in a separate PR.

I do think they should be added (I checked that they aren't already there), but I don't personally have availability to do so at this time.

brandondahler avatar Aug 18 '22 00:08 brandondahler

Oops, it slipped through my fingers. Mind retriggering https://github.com/brandondahler/spark/runs/7585897593?

HyukjinKwon avatar Aug 18 '22 01:08 HyukjinKwon

cc @zero323, @itholic, @zhengruifeng FYI (since we need to add PySpark and SparkR ones)

HyukjinKwon avatar Aug 18 '22 01:08 HyukjinKwon

Clicked re-run all jobs on that linked run, let me know if there was something else you meant for me to do

brandondahler avatar Aug 18 '22 02:08 brandondahler

since pyspark/sql/tests/test_functions.py will check the parity between PySpark and SQL, so I think we may need to add array_sort into expected_missing_in_py

otherwise, LGTM

zhengruifeng avatar Aug 18 '22 02:08 zhengruifeng

It seems like it has to be re-synced with upstream, to address black failures.

zero323 avatar Aug 18 '22 18:08 zero323

Rebased on lastest master changes

brandondahler avatar Aug 20 '22 23:08 brandondahler

Merged to master.

HyukjinKwon avatar Aug 21 '22 09:08 HyukjinKwon