spark icon indicating copy to clipboard operation
spark copied to clipboard

[SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0

Open itholic opened this issue 1 year ago • 1 comments

What changes were proposed in this pull request?

This PR proposes to upgrade Pandas to 2.2.0.

See What's new in 2.2.0 (January 19, 2024)

Why are the changes needed?

Pandas 2.2.0 is released, and we should support the latest Pandas.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

The existing CI should pass

Was this patch authored or co-authored using generative AI tooling?

No.

itholic avatar Jan 25 '24 08:01 itholic

Yeah, Pandas fixes many bugs from Pandas 2.2.0 that brings couple of behavior changes 😢

Let me fix them. Thanks for the confirm!

itholic avatar Feb 14 '24 01:02 itholic

I believe now this PR completed to address all of Pandas 2.2.0 behavior. cc @HyukjinKwon @dongjoon-hyun FYI

itholic avatar Feb 20 '24 06:02 itholic

  • Is the change of python/pyspark/pandas/resample.py safe?

It breaks the previous behavior, so if we plan to release other minor release (Spark 3.6.0) this should not be included.

  • What happens when the users decide to use old Pandas (<= 2.2.0)?

Using deprecated aliases (Y, M, H, T, S) wouldn't work.

itholic avatar Feb 20 '24 06:02 itholic

We should not bring any breaking change. Let me address them.

Thanks, @dongjoon-hyun for double checking.

itholic avatar Feb 20 '24 06:02 itholic

Oh, wait.

I just remembered that we just follow the Pandas behavior and separately mention the breaking changes into release note.

- In Spark 4.0, it is recommended to use Pandas version 2.0.0 or above with PySpark for optimal compatibility.
- In Spark 4.0, the minimum supported version for Pandas has been raised from 1.0.5 to 1.4.4 in PySpark.
...
- In Spark 4.0, when applying astype to a decimal type object, the existing missing value is changed to True instead of False from Pandas API on Spark.
- In Spark 4.0, pyspark.testing.assertPandasOnSparkEqual has been removed from Pandas API on Spark, use pyspark.pandas.testing.assert_frame_equal instead.

So maybe we should add a release note instead of reverting the breaking changes here? @dongjoon-hyun @HyukjinKwon

itholic avatar Feb 20 '24 07:02 itholic

Just updated to resample work in old Pandas as well.

I think we can just make it as deprecate for now to avoid breaking the existing pipeline. (Also updated the release note)

itholic avatar Feb 20 '24 07:02 itholic

Merged to master.

Thank you again, @itholic and @HyukjinKwon .

dongjoon-hyun avatar Feb 20 '24 15:02 dongjoon-hyun

Great work @itholic Thank you :)

bjornjorgensen avatar Feb 20 '24 20:02 bjornjorgensen

Thank you so much all for the review!

itholic avatar Feb 21 '24 00:02 itholic