spark [SPARK-38946][PYTHON][PS] Generates a new dataframe instead of operating inplace in setitem

What changes were proposed in this pull request?

Generates a new dataframe instead of operating inplace in setitem

Why are the changes needed?

Make CI passed in with pandas 1.4.3

Since pandas 1.4.0 https://github.com/pandas-dev/pandas/commit/03dd698bc1e84c35aba8b51bdd45c472860b9ec3 , dataframe.setitem should always make a copy and never write into the existing array.

Does this PR introduce any user-facing change?

No

How was this patch tested?

CI test with current pandas (1.3.x) and latest pandas 1.4.2, 1.4.3

Apr 26 '22 07:04 Yikun

This is still a WIP PR now, only make ci passed with panda 1.4.x.

~- If we want to follow pandas behavior, we might want to revert https://github.com/databricks/koalas/pull/1592~ ~- If we only want to still keep this, we might only need this test fix PR.~

Apr 26 '22 07:04 Yikun

I haven't got enough info to see why we added https://github.com/databricks/koalas/pull/1592 before (performance or follow pandas old behavior).

cc @HyukjinKwon @ueshin Maybe you could give some input or suggestion in here.

Apr 26 '22 07:04 Yikun

I think this needs @ueshin's look

Apr 26 '22 10:04 HyukjinKwon

I haven't got enough info to see why we added https://github.com/databricks/koalas/pull/1592 before (performance or follow pandas old behavior).

That's to follow the pandas behavior at that time and seems like it's still necessary.

The examples in the description of https://github.com/databricks/koalas/pull/1592 are still valid. E.g.,:

>>> pd.__version__
'1.4.1'
>>> pdf = pd.DataFrame({"x": [np.nan, 2, 3, 4, np.nan, 6], "y": [np.nan, 2, 3, 4, np.nan, 6]})
>>> pser = pdf.x
>>> pser.fillna(0, inplace=True)
>>> pser
0    0.0
1    2.0
2    3.0
3    4.0
4    0.0
5    6.0
Name: x, dtype: float64
>>> pdf
     x    y
0  0.0  NaN
1  2.0  2.0
2  3.0  3.0
3  4.0  4.0
4  0.0  NaN
5  6.0  6.0

If we revert the change, it won't work anymore.

Have you tried to revert the changes except for the tests and run the whole tests? I guess some tests fail, especially some in test_series.py.

If there are behavior changes in pandas, we should fix our behavior.

Apr 26 '22 22:04 ueshin

@ueshin OK, thanks for your info, I will take an another look soon.

Apr 27 '22 13:04 Yikun

Just confirming, any update on this ?

May 24 '22 03:05 itholic

@itholic We might want to do some special process when calling _update_internal_frame for pandas 1.4.x. Will update today.

May 24 '22 04:05 Yikun

Test with Panda 1.4.2: https://github.com/Yikun/spark/runs/6574570761?check_suite_focus=true#step:8:20

All UT passed, but doctest failed due to other unrelated failed.

This PR is ready for review.

@HyukjinKwon @xinrong-databricks @itholic

May 24 '22 13:05 Yikun

The renaming is so much better, thanks Yikun! LGTM.

May 27 '22 16:05 xinrong-meng

Let me defer to @ueshin

May 30 '22 02:05 HyukjinKwon

The condition to generate a new dataframe seems a bit more complex?

@ueshin Yes, it is really yes from your example, the setitem has behavior influence on several functions and change the final or part of behaviors for these functions (but pandas not mention this behavior change). Anyway, I will raise a issue on Pandas comunity to get the detail attitude for these beahavior changes.

Jun 01 '22 03:06 Yikun

https://github.com/pandas-dev/pandas/issues/47188

Jun 01 '22 08:06 Yikun

According to https://github.com/pandas-dev/pandas/issues/47188 and https://github.com/pandas-dev/pandas/issues/47449 , we should only address set_item in here, so just skip test for oldest 1.4.x pandas.

Jun 23 '22 12:06 Yikun

@ueshin ready for review, it would be good if you could find some time. : )

Jun 24 '22 15:06 Yikun

Let me do a brief conclusion to help review:

Change setitem to make a copyhttps://github.com/pandas-dev/pandas/commit/03dd698bc1e84c35aba8b51bdd45c472860b9ec3 dataframe.setitem make a copy and never write into the existing array, so we also follow this.
Skip update/fillna for 1.4.0-1.4.2: https://github.com/pandas-dev/pandas/issues/47188 this issue fix the update and fillna behaviors, I confirm it had been resolved by pandas 1.4.3, so we skip the validate copy check.
Skip eval for 1.4.0-1.4.3: https://github.com/pandas-dev/pandas/issues/47449 eval still have a regression, so we skip it in 1.4.0~1.4.3.

I also test 1.4.2 CI with this PR: https://github.com/Yikun/spark/pull/113, all tests passed, so we can upgrade pandas to 1.4.2 in CI after this PR merged.

So, it's ready for review!

Jul 08 '22 12:07 Yikun

I will leave it to @ueshin

Jul 11 '22 02:07 HyukjinKwon

just a rebase because we update numpy version in infra, and add a line in migration guide,

Jul 14 '22 01:07 Yikun

All test passed and

Test passed with Pandas 1.4.2: https://github.com/Yikun/spark/pull/148

Test passed with Pandas 1.4.3: https://github.com/Yikun/spark/pull/147

@ueshin Would you mind take a look? Thanks!

Aug 09 '22 22:08 Yikun

@ueshin Thanks for review!

Aug 11 '22 01:08 Yikun

@HyukjinKwon Mind to take another look?

Aug 11 '22 12:08 Yikun

Just a rebase

Aug 17 '22 10:08 Yikun

Merged to master.

Aug 18 '22 03:08 HyukjinKwon

spark spark copied to clipboard

[SPARK-38946][PYTHON][PS] Generates a new dataframe instead of operating inplace in setitem

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

spark
spark copied to clipboard