spark icon indicating copy to clipboard operation
spark copied to clipboard

[SPARK-40512][PS][INFRA] Upgrade pandas to 1.5.0

Open itholic opened this issue 3 years ago • 3 comments

What changes were proposed in this pull request?

This PR proposes to upgrade pandas version to 1.5.0 since the new pandas version is released.

Please refer to What's new in 1.5.0 for more detail.

Why are the changes needed?

Pandas API on Spark should follow the latest pandas.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

The existing tests should pass

itholic avatar Sep 21 '22 06:09 itholic

what about also changing dev/requirements.txt

zhengruifeng avatar Sep 21 '22 06:09 zhengruifeng

what about also changing dev/requirements.txt

Maybe we don't need to set an upper bound for pandas in dev/requirements.txt because pip install -r dev/requirements.txt always install the latest pandas when the version is not specified??

itholic avatar Sep 21 '22 06:09 itholic

I think the upper bound is there to avoid pulling in later pandas versions that may not be tested/supported yet? Indeed, it looks like tests don't quite pass with 1.5.0 yet

srowen avatar Sep 21 '22 13:09 srowen

As Sean mentioned, it seems to fail. Do we need to fix the failures or to set the upperbound (< 1.5)?

cc @HyukjinKwon , too.

dongjoon-hyun avatar Sep 21 '22 16:09 dongjoon-hyun

Yeah, I think we should make the test pass since pandas-on-Spark should follow the behavior of latest pandas.

Let me take a look. Thanks!

itholic avatar Sep 22 '22 00:09 itholic

Generally speaking almost each pandas upgrade will cause PS CI to fail, for the stability of CI, our strategy was using <= to pin specific version (it's equal to == in our case), and upgrade to specific version manually.

@itholic FYI, these two testcase are failed due to:

  • python/pyspark/pandas/tests/indexes/test_category.py.test_append: https://github.com/pandas-dev/pandas/commit/c7b470c3e13f99ce990e23b2a311d3a2c633499c
  • python/pyspark/pandas/tests/indexes/test_base.py.test_to_frame: https://github.com/pandas-dev/pandas/commit/7dbfe9f95a31aa01dc8288350e785f04a2abf6f0

not having a more deep look, you could do more invistigation and fix or fix test, just like SPARK-38819.

Yikun avatar Sep 22 '22 01:09 Yikun

what about also changing dev/requirements.txt

Maybe we don't need to set an upper bound for pandas in dev/requirements.txt because pip install -r dev/requirements.txt always install the latest pandas when the version is not specified??

I was hit by the conflicts several times caused by the version differences between CI and dev/requirements.txt, but in the mean time, it keep us aware of the dependency updates, so fine to let it alone here.

zhengruifeng avatar Sep 22 '22 01:09 zhengruifeng

Thanks for the comments!

Let me investigate the test failure and make an umbrella ticket if there are many failures.

If there is few failures, let me handle them in this PR at once.

itholic avatar Sep 22 '22 01:09 itholic

Test passed.

itholic avatar Oct 26 '22 11:10 itholic

Thanks for the review, @HyukjinKwon and @zhengruifeng !

Just addressed the comments

itholic avatar Oct 27 '22 05:10 itholic

CI passed!

itholic avatar Oct 28 '22 01:10 itholic

Merged into master, thank you @itholic for doing this!

zhengruifeng avatar Oct 28 '22 03:10 zhengruifeng

Thank you all!

dongjoon-hyun avatar Oct 28 '22 05:10 dongjoon-hyun