spark
spark copied to clipboard
[SPARK-39925][SQL] Add array_sort(column, comparator) overload to DataFrame operations
What changes were proposed in this pull request?
Adding a new array_sort overload to org.apache.spark.sql.functions that matches the new overload defined in SPARK-29020 and added via #25728.
Why are the changes needed?
Adds access to the new overload for users of the DataFrame API so that they don't need to use the expr escape hatch.
Does this PR introduce any user-facing change?
Yes, now allows users to optionally provide a comparator function to the array_sort, which opens up the ability to sort descending as well as sort items that aren't naturally orderable.
Example:
Old:
df.selectExpr("array_sort(a, (x, y) -> cardinality(x) - cardinality(y))");
Added:
df.select(array_sort(col("a"), (x, y) => size(x) - size(y)));
How was this patch tested?
Unit tests updated to validate that the overload matches the expression's behavior.
Can one of the admins verify this patch?
LGTM. are you also interested in adding this in SparkR and PySpark? We can do that in a separate PR.
I do think they should be added (I checked that they aren't already there), but I don't personally have availability to do so at this time.
Oops, it slipped through my fingers. Mind retriggering https://github.com/brandondahler/spark/runs/7585897593?
cc @zero323, @itholic, @zhengruifeng FYI (since we need to add PySpark and SparkR ones)
Clicked re-run all jobs on that linked run, let me know if there was something else you meant for me to do
since pyspark/sql/tests/test_functions.py will check the parity between PySpark and SQL, so I think we may need to add array_sort into expected_missing_in_py
otherwise, LGTM
It seems like it has to be re-synced with upstream, to address black failures.
Rebased on lastest master changes
Merged to master.