[SPARK-48482][PYTHON][FOLLOWUP] dropDuplicates and dropDuplicatesWIthinWatermark should accept named parameter
What changes were proposed in this pull request?
https://github.com/apache/spark/commit/560c08332b35941260169124b4f522bdc82b84d8 unintentionally made dropDuplicates(subset=["col"]) doesn't work, this patches this scenario.
Why are the changes needed?
Bug fix
Does this PR introduce any user-facing change?
No
How was this patch tested?
Unit test
Was this patch authored or co-authored using generative AI tooling?
No
cc @HyukjinKwon @allisonwang-db @itholic PTAL!
My general comment here is that we should be opinionated and only have one way to perform certain operations. After this change, users now have two identical ways to drop duplicates:
dropDuplicates("c1", "c2")dropDuplicates(["c1", "c2"])
Which one should users choose?
P.S. Pandas / Pandas on Spark uses the subset=["c1", "c2"] pattern (see: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html and https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.drop_duplicates.html). It could be confusing to make the PySpark API different.