spark [SPARK-48482][PYTHON][FOLLOWUP] dropDuplicates and dropDuplicatesWIthinWatermark should accept named parameter

What changes were proposed in this pull request?

https://github.com/apache/spark/commit/560c08332b35941260169124b4f522bdc82b84d8 unintentionally made dropDuplicates(subset=["col"]) doesn't work, this patches this scenario.

Why are the changes needed?

Bug fix

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit test

Was this patch authored or co-authored using generative AI tooling?

No

Aug 21 '24 22:08 WweiL

cc @HyukjinKwon @allisonwang-db @itholic PTAL!

Aug 21 '24 22:08 WweiL

My general comment here is that we should be opinionated and only have one way to perform certain operations. After this change, users now have two identical ways to drop duplicates:

dropDuplicates("c1", "c2")
dropDuplicates(["c1", "c2"])

Which one should users choose?

P.S. Pandas / Pandas on Spark uses the subset=["c1", "c2"] pattern (see: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html and https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.drop_duplicates.html). It could be confusing to make the PySpark API different.

Aug 26 '24 20:08 allisonwang-db