spark
spark copied to clipboard
[SPARK-38946][PYTHON][PS] Generates a new dataframe instead of operating inplace in setitem
What changes were proposed in this pull request?
Generates a new dataframe instead of operating inplace in setitem
Why are the changes needed?
Make CI passed in with pandas 1.4.3
Since pandas 1.4.0 https://github.com/pandas-dev/pandas/commit/03dd698bc1e84c35aba8b51bdd45c472860b9ec3 , dataframe.setitem should always make a copy and never write into the existing array.
Does this PR introduce any user-facing change?
No
How was this patch tested?
CI test with current pandas (1.3.x) and latest pandas 1.4.2, 1.4.3
This is still a WIP PR now, only make ci passed with panda 1.4.x.
~- If we want to follow pandas behavior, we might want to revert https://github.com/databricks/koalas/pull/1592~ ~- If we only want to still keep this, we might only need this test fix PR.~
I haven't got enough info to see why we added https://github.com/databricks/koalas/pull/1592 before (performance or follow pandas old behavior).
cc @HyukjinKwon @ueshin Maybe you could give some input or suggestion in here.
I think this needs @ueshin's look
I haven't got enough info to see why we added https://github.com/databricks/koalas/pull/1592 before (performance or follow pandas old behavior).
That's to follow the pandas behavior at that time and seems like it's still necessary.
The examples in the description of https://github.com/databricks/koalas/pull/1592 are still valid. E.g.,:
>>> pd.__version__
'1.4.1'
>>> pdf = pd.DataFrame({"x": [np.nan, 2, 3, 4, np.nan, 6], "y": [np.nan, 2, 3, 4, np.nan, 6]})
>>> pser = pdf.x
>>> pser.fillna(0, inplace=True)
>>> pser
0 0.0
1 2.0
2 3.0
3 4.0
4 0.0
5 6.0
Name: x, dtype: float64
>>> pdf
x y
0 0.0 NaN
1 2.0 2.0
2 3.0 3.0
3 4.0 4.0
4 0.0 NaN
5 6.0 6.0
If we revert the change, it won't work anymore.
Have you tried to revert the changes except for the tests and run the whole tests?
I guess some tests fail, especially some in test_series.py.
If there are behavior changes in pandas, we should fix our behavior.
@ueshin OK, thanks for your info, I will take an another look soon.
Just confirming, any update on this ?
@itholic We might want to do some special process when calling _update_internal_frame for pandas 1.4.x. Will update today.
Test with Panda 1.4.2: https://github.com/Yikun/spark/runs/6574570761?check_suite_focus=true#step:8:20
All UT passed, but doctest failed due to other unrelated failed.
This PR is ready for review.
@HyukjinKwon @xinrong-databricks @itholic
The renaming is so much better, thanks Yikun! LGTM.
Let me defer to @ueshin
The condition to generate a new dataframe seems a bit more complex?
@ueshin Yes, it is really yes from your example, the setitem has behavior influence on several functions and change the final or part of behaviors for these functions (but pandas not mention this behavior change). Anyway, I will raise a issue on Pandas comunity to get the detail attitude for these beahavior changes.
https://github.com/pandas-dev/pandas/issues/47188
According to https://github.com/pandas-dev/pandas/issues/47188 and https://github.com/pandas-dev/pandas/issues/47449 , we should only address set_item in here, so just skip test for oldest 1.4.x pandas.
@ueshin ready for review, it would be good if you could find some time. : )
Let me do a brief conclusion to help review:
- Change
setitemto make a copyhttps://github.com/pandas-dev/pandas/commit/03dd698bc1e84c35aba8b51bdd45c472860b9ec3 dataframe.setitem make a copy and never write into the existing array, so we also follow this. - Skip update/fillna for 1.4.0-1.4.2: https://github.com/pandas-dev/pandas/issues/47188 this issue fix the
updateandfillnabehaviors, I confirm it had been resolved by pandas 1.4.3, so we skip the validate copy check. - Skip eval for 1.4.0-1.4.3: https://github.com/pandas-dev/pandas/issues/47449
evalstill have a regression, so we skip it in 1.4.0~1.4.3.
I also test 1.4.2 CI with this PR: https://github.com/Yikun/spark/pull/113, all tests passed, so we can upgrade pandas to 1.4.2 in CI after this PR merged.
So, it's ready for review!
I will leave it to @ueshin
just a rebase because we update numpy version in infra, and add a line in migration guide,
All test passed and
Test passed with Pandas 1.4.2: https://github.com/Yikun/spark/pull/148
Test passed with Pandas 1.4.3: https://github.com/Yikun/spark/pull/147
@ueshin Would you mind take a look? Thanks!
@ueshin Thanks for review!
@HyukjinKwon Mind to take another look?
Just a rebase
Merged to master.