spark
spark copied to clipboard
[SPARK-40512][PS][INFRA] Upgrade pandas to 1.5.0
What changes were proposed in this pull request?
This PR proposes to upgrade pandas version to 1.5.0 since the new pandas version is released.
Please refer to What's new in 1.5.0 for more detail.
Why are the changes needed?
Pandas API on Spark should follow the latest pandas.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
The existing tests should pass
what about also changing dev/requirements.txt
what about also changing
dev/requirements.txt
Maybe we don't need to set an upper bound for pandas in dev/requirements.txt because pip install -r dev/requirements.txt always install the latest pandas when the version is not specified??
I think the upper bound is there to avoid pulling in later pandas versions that may not be tested/supported yet? Indeed, it looks like tests don't quite pass with 1.5.0 yet
As Sean mentioned, it seems to fail. Do we need to fix the failures or to set the upperbound (< 1.5)?
cc @HyukjinKwon , too.
Yeah, I think we should make the test pass since pandas-on-Spark should follow the behavior of latest pandas.
Let me take a look. Thanks!
Generally speaking almost each pandas upgrade will cause PS CI to fail, for the stability of CI, our strategy was using <= to pin specific version (it's equal to == in our case), and upgrade to specific version manually.
@itholic FYI, these two testcase are failed due to:
- python/pyspark/pandas/tests/indexes/test_category.py.test_append: https://github.com/pandas-dev/pandas/commit/c7b470c3e13f99ce990e23b2a311d3a2c633499c
- python/pyspark/pandas/tests/indexes/test_base.py.test_to_frame: https://github.com/pandas-dev/pandas/commit/7dbfe9f95a31aa01dc8288350e785f04a2abf6f0
not having a more deep look, you could do more invistigation and fix or fix test, just like SPARK-38819.
what about also changing
dev/requirements.txtMaybe we don't need to set an upper bound for pandas in
dev/requirements.txtbecausepip install -r dev/requirements.txtalways install the latest pandas when the version is not specified??
I was hit by the conflicts several times caused by the version differences between CI and dev/requirements.txt, but in the mean time, it keep us aware of the dependency updates, so fine to let it alone here.
Thanks for the comments!
Let me investigate the test failure and make an umbrella ticket if there are many failures.
If there is few failures, let me handle them in this PR at once.
Test passed.
Thanks for the review, @HyukjinKwon and @zhengruifeng !
Just addressed the comments
CI passed!
Merged into master, thank you @itholic for doing this!
Thank you all!