spark icon indicating copy to clipboard operation
spark copied to clipboard

[SPARK-38946][PYTHON][PS] Generates a new dataframe instead of operating inplace in setitem

Open Yikun opened this issue 3 years ago • 18 comments

What changes were proposed in this pull request?

Generates a new dataframe instead of operating inplace in setitem

Why are the changes needed?

Make CI passed in with pandas 1.4.3

Since pandas 1.4.0 https://github.com/pandas-dev/pandas/commit/03dd698bc1e84c35aba8b51bdd45c472860b9ec3 , dataframe.setitem should always make a copy and never write into the existing array.

Does this PR introduce any user-facing change?

No

How was this patch tested?

CI test with current pandas (1.3.x) and latest pandas 1.4.2, 1.4.3

Yikun avatar Apr 26 '22 07:04 Yikun

This is still a WIP PR now, only make ci passed with panda 1.4.x.

~- If we want to follow pandas behavior, we might want to revert https://github.com/databricks/koalas/pull/1592~ ~- If we only want to still keep this, we might only need this test fix PR.~

Yikun avatar Apr 26 '22 07:04 Yikun

I haven't got enough info to see why we added https://github.com/databricks/koalas/pull/1592 before (performance or follow pandas old behavior).

cc @HyukjinKwon @ueshin Maybe you could give some input or suggestion in here.

Yikun avatar Apr 26 '22 07:04 Yikun

I think this needs @ueshin's look

HyukjinKwon avatar Apr 26 '22 10:04 HyukjinKwon

I haven't got enough info to see why we added https://github.com/databricks/koalas/pull/1592 before (performance or follow pandas old behavior).

That's to follow the pandas behavior at that time and seems like it's still necessary.

The examples in the description of https://github.com/databricks/koalas/pull/1592 are still valid. E.g.,:

>>> pd.__version__
'1.4.1'
>>> pdf = pd.DataFrame({"x": [np.nan, 2, 3, 4, np.nan, 6], "y": [np.nan, 2, 3, 4, np.nan, 6]})
>>> pser = pdf.x
>>> pser.fillna(0, inplace=True)
>>> pser
0    0.0
1    2.0
2    3.0
3    4.0
4    0.0
5    6.0
Name: x, dtype: float64
>>> pdf
     x    y
0  0.0  NaN
1  2.0  2.0
2  3.0  3.0
3  4.0  4.0
4  0.0  NaN
5  6.0  6.0

If we revert the change, it won't work anymore.

Have you tried to revert the changes except for the tests and run the whole tests? I guess some tests fail, especially some in test_series.py.

If there are behavior changes in pandas, we should fix our behavior.

ueshin avatar Apr 26 '22 22:04 ueshin

@ueshin OK, thanks for your info, I will take an another look soon.

Yikun avatar Apr 27 '22 13:04 Yikun

Just confirming, any update on this ?

itholic avatar May 24 '22 03:05 itholic

@itholic We might want to do some special process when calling _update_internal_frame for pandas 1.4.x. Will update today.

Yikun avatar May 24 '22 04:05 Yikun

Test with Panda 1.4.2: https://github.com/Yikun/spark/runs/6574570761?check_suite_focus=true#step:8:20

All UT passed, but doctest failed due to other unrelated failed.

This PR is ready for review.

@HyukjinKwon @xinrong-databricks @itholic

Yikun avatar May 24 '22 13:05 Yikun

The renaming is so much better, thanks Yikun! LGTM.

xinrong-meng avatar May 27 '22 16:05 xinrong-meng

Let me defer to @ueshin

HyukjinKwon avatar May 30 '22 02:05 HyukjinKwon

The condition to generate a new dataframe seems a bit more complex?

@ueshin Yes, it is really yes from your example, the setitem has behavior influence on several functions and change the final or part of behaviors for these functions (but pandas not mention this behavior change). Anyway, I will raise a issue on Pandas comunity to get the detail attitude for these beahavior changes.

Yikun avatar Jun 01 '22 03:06 Yikun

https://github.com/pandas-dev/pandas/issues/47188

Yikun avatar Jun 01 '22 08:06 Yikun

According to https://github.com/pandas-dev/pandas/issues/47188 and https://github.com/pandas-dev/pandas/issues/47449 , we should only address set_item in here, so just skip test for oldest 1.4.x pandas.

Yikun avatar Jun 23 '22 12:06 Yikun

@ueshin ready for review, it would be good if you could find some time. : )

Yikun avatar Jun 24 '22 15:06 Yikun

Let me do a brief conclusion to help review:

  • Change setitem to make a copyhttps://github.com/pandas-dev/pandas/commit/03dd698bc1e84c35aba8b51bdd45c472860b9ec3 dataframe.setitem make a copy and never write into the existing array, so we also follow this.
  • Skip update/fillna for 1.4.0-1.4.2: https://github.com/pandas-dev/pandas/issues/47188 this issue fix the update and fillna behaviors, I confirm it had been resolved by pandas 1.4.3, so we skip the validate copy check.
  • Skip eval for 1.4.0-1.4.3: https://github.com/pandas-dev/pandas/issues/47449 eval still have a regression, so we skip it in 1.4.0~1.4.3.

I also test 1.4.2 CI with this PR: https://github.com/Yikun/spark/pull/113, all tests passed, so we can upgrade pandas to 1.4.2 in CI after this PR merged.

So, it's ready for review!

Yikun avatar Jul 08 '22 12:07 Yikun

I will leave it to @ueshin

HyukjinKwon avatar Jul 11 '22 02:07 HyukjinKwon

just a rebase because we update numpy version in infra, and add a line in migration guide,

Yikun avatar Jul 14 '22 01:07 Yikun

All test passed and

Test passed with Pandas 1.4.2: https://github.com/Yikun/spark/pull/148

Test passed with Pandas 1.4.3: https://github.com/Yikun/spark/pull/147

@ueshin Would you mind take a look? Thanks!

Yikun avatar Aug 09 '22 22:08 Yikun

@ueshin Thanks for review!

Yikun avatar Aug 11 '22 01:08 Yikun

@HyukjinKwon Mind to take another look?

Yikun avatar Aug 11 '22 12:08 Yikun

Just a rebase

Yikun avatar Aug 17 '22 10:08 Yikun

Merged to master.

HyukjinKwon avatar Aug 18 '22 03:08 HyukjinKwon