spark
spark copied to clipboard
[SPARK-39824][PYTHON][PS] Introduce index where and putmask func in pyspark
What changes were proposed in this pull request?
Add more pyspark pandas Index func which is similar with pandas.
Why are the changes needed?
Add where and putmask which is very similar func into pyspark pandas
Does this PR introduce any user-facing change?
No
How was this patch tested?
>>> idx = ps.Index(['car', 'bike', 'train', 'tractor'])
>>> idx
Index(['car', 'bike', 'train', 'tractor'], dtype='object')
>>> idx.where(idx.isin(['car', 'train']), 'other')
Index(['car', 'other', 'train', 'other'], dtype='object')
cc @zhengruifeng @xinrong-meng @itholic FYI
Oh btw, we can use [PS] tag for the PR title related to pandas-on-Spark changes.
Can we also add [PS] to the title ? (and your other open PRs such as https://github.com/apache/spark/pull/37044, https://github.com/apache/spark/pull/37232, https://github.com/apache/spark/pull/37234 as well :-))
Can one of the admins verify this patch?
Oh btw, we can use
[PS]tag for the PR title related to pandas-on-Spark changes.Can we also add
[PS]to the title ? (and your other open PRs such as #37044, #37232, #37234 as well :-))
OK. Sorry that not familiar with Spark PR contribution protocol
new methods should also be listed in python/docs/source/reference/pyspark.pandas/indexing.rst
this failure needs be fixed
======================================================================
ERROR [0.436s]: test_missing (pyspark.pandas.tests.indexes.test_base.IndexesTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/__w/spark/spark/python/pyspark/pandas/tests/indexes/test_base.py", line 508, in test_missing
getattr(psdf.set_index(["a", "b"]).index, name)()
TypeError: putmask() missing 2 required positional arguments: 'mask' and 'value'
----------------------------------------------------------------------
as to the linter failure, you may just run dev/reformat-python
also cc @Yikun
new methods should also be listed in python/docs/source/reference/pyspark.pandas/indexing.rst
Oh, yes. I forgot. Thanks
this failure needs be fixed
====================================================================== ERROR [0.436s]: test_missing (pyspark.pandas.tests.indexes.test_base.IndexesTest) ---------------------------------------------------------------------- Traceback (most recent call last): File "/__w/spark/spark/python/pyspark/pandas/tests/indexes/test_base.py", line 508, in test_missing getattr(psdf.set_index(["a", "b"]).index, name)() TypeError: putmask() missing 2 required positional arguments: 'mask' and 'value' ----------------------------------------------------------------------as to the linter failure, you may just run
dev/reformat-python
Hi Zheng, Thank you very much. But I think I need some helps about the MultiIndex putmask. I didn't find any usecase about MultiIndex putmask, and the behaviors between pandas and numpy are different too. So I'm still not verify the putmask func I introduced work well. Could you pls show some usecases for helping this?
BTW, This due to I didn't remove the 'putmask' in MultiIndex missing List. ;-).
@bzhaoopenstack
I didn't find any usecase about MultiIndex putmask
does pandas support MultiIndex putmask
the behaviors between pandas and numpy are different too.
do you mean the MultiIndex?
@bzhaoopenstack
I didn't find any usecase about MultiIndex putmask
does pandas support
MultiIndex putmask
From the release doc and API doc, yeah, it claims that MultiIndex supports putmask(doesn't support where), but no any sample. So it's hard to verify the functionality for me, as I never call multiIndex.putmask with pandas successfully.
the behaviors between pandas and numpy are different too.
do you mean the
MultiIndex?Yeah, from my exploration, the numpy.putmask behavior is different with pandas MultiIndex.putmask. As those can success in numpy, but fail in pandas.
Yeah, from my exploration, the numpy.putmask behavior is different with pandas MultiIndex.putmask. As those can success in numpy, but fail in pandas.
I guess we don't care the difference between pandas and numpy. We'd better make the behavior of PS(Pandas API on Spark) the same as Pandas.
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!