spark [SPARK-39925][SQL] Add array_sort(column, comparator) overload to DataFrame operations

[SPARK-39925][SQL] Add array_sort(column, comparator) overload to DataFrame operations

Open brandondahler opened this issue 3 years ago • 1 comments

What changes were proposed in this pull request?

Adding a new array_sort overload to org.apache.spark.sql.functions that matches the new overload defined in SPARK-29020 and added via #25728.

Why are the changes needed?

Adds access to the new overload for users of the DataFrame API so that they don't need to use the expr escape hatch.

Does this PR introduce any user-facing change?

Yes, now allows users to optionally provide a comparator function to the array_sort, which opens up the ability to sort descending as well as sort items that aren't naturally orderable.

Example:

Old:

df.selectExpr("array_sort(a, (x, y) -> cardinality(x) - cardinality(y))");

Added:

df.select(array_sort(col("a"), (x, y) => size(x) - size(y)));

How was this patch tested?

Unit tests updated to validate that the overload matches the expression's behavior.

Aug 01 '22 12:08 brandondahler

Can one of the admins verify this patch?

Aug 01 '22 20:08 AmplabJenkins

LGTM. are you also interested in adding this in SparkR and PySpark? We can do that in a separate PR.

I do think they should be added (I checked that they aren't already there), but I don't personally have availability to do so at this time.

Aug 18 '22 00:08 brandondahler

Oops, it slipped through my fingers. Mind retriggering https://github.com/brandondahler/spark/runs/7585897593?

Aug 18 '22 01:08 HyukjinKwon

cc @zero323, @itholic, @zhengruifeng FYI (since we need to add PySpark and SparkR ones)

Aug 18 '22 01:08 HyukjinKwon

Clicked re-run all jobs on that linked run, let me know if there was something else you meant for me to do

Aug 18 '22 02:08 brandondahler

since pyspark/sql/tests/test_functions.py will check the parity between PySpark and SQL, so I think we may need to add array_sort into expected_missing_in_py

otherwise, LGTM

Aug 18 '22 02:08 zhengruifeng

It seems like it has to be re-synced with upstream, to address black failures.

Aug 18 '22 18:08 zero323

Rebased on lastest master changes

Aug 20 '22 23:08 brandondahler

Merged to master.

Aug 21 '22 09:08 HyukjinKwon

spark spark copied to clipboard

[SPARK-39925][SQL] Add array_sort(column, comparator) overload to DataFrame operations

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

Example:

How was this patch tested?

spark
spark copied to clipboard