spark icon indicating copy to clipboard operation
spark copied to clipboard

[SPARK-48482][PYTHON][FOLLOWUP] dropDuplicates and dropDuplicatesWIthinWatermark should accept named parameter

Open WweiL opened this issue 1 year ago • 2 comments

What changes were proposed in this pull request?

https://github.com/apache/spark/commit/560c08332b35941260169124b4f522bdc82b84d8 unintentionally made dropDuplicates(subset=["col"]) doesn't work, this patches this scenario.

Why are the changes needed?

Bug fix

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit test

Was this patch authored or co-authored using generative AI tooling?

No

WweiL avatar Aug 21 '24 22:08 WweiL

cc @HyukjinKwon @allisonwang-db @itholic PTAL!

WweiL avatar Aug 21 '24 22:08 WweiL

My general comment here is that we should be opinionated and only have one way to perform certain operations. After this change, users now have two identical ways to drop duplicates:

  1. dropDuplicates("c1", "c2")
  2. dropDuplicates(["c1", "c2"])

Which one should users choose?

P.S. Pandas / Pandas on Spark uses the subset=["c1", "c2"] pattern (see: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html and https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.drop_duplicates.html). It could be confusing to make the PySpark API different.

allisonwang-db avatar Aug 26 '24 20:08 allisonwang-db