spark icon indicating copy to clipboard operation
spark copied to clipboard

[SPARK-39824][PYTHON][PS] Introduce index where and putmask func in pyspark

Open bzhaoopenstack opened this issue 3 years ago • 12 comments

What changes were proposed in this pull request?

Add more pyspark pandas Index func which is similar with pandas.

Why are the changes needed?

Add where and putmask which is very similar func into pyspark pandas

Does this PR introduce any user-facing change?

No

How was this patch tested?

        >>> idx = ps.Index(['car', 'bike', 'train', 'tractor'])
        >>> idx
        Index(['car', 'bike', 'train', 'tractor'], dtype='object')
        >>> idx.where(idx.isin(['car', 'train']), 'other')
        Index(['car', 'other', 'train', 'other'], dtype='object')

bzhaoopenstack avatar Jul 20 '22 07:07 bzhaoopenstack

cc @zhengruifeng @xinrong-meng @itholic FYI

HyukjinKwon avatar Jul 20 '22 08:07 HyukjinKwon

Oh btw, we can use [PS] tag for the PR title related to pandas-on-Spark changes.

Can we also add [PS] to the title ? (and your other open PRs such as https://github.com/apache/spark/pull/37044, https://github.com/apache/spark/pull/37232, https://github.com/apache/spark/pull/37234 as well :-))

itholic avatar Jul 20 '22 09:07 itholic

Can one of the admins verify this patch?

AmplabJenkins avatar Jul 20 '22 23:07 AmplabJenkins

Oh btw, we can use [PS] tag for the PR title related to pandas-on-Spark changes.

Can we also add [PS] to the title ? (and your other open PRs such as #37044, #37232, #37234 as well :-))

OK. Sorry that not familiar with Spark PR contribution protocol

bzhaoopenstack avatar Jul 21 '22 02:07 bzhaoopenstack

new methods should also be listed in python/docs/source/reference/pyspark.pandas/indexing.rst

zhengruifeng avatar Jul 21 '22 09:07 zhengruifeng

this failure needs be fixed

======================================================================
ERROR [0.436s]: test_missing (pyspark.pandas.tests.indexes.test_base.IndexesTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/__w/spark/spark/python/pyspark/pandas/tests/indexes/test_base.py", line 508, in test_missing
    getattr(psdf.set_index(["a", "b"]).index, name)()
TypeError: putmask() missing 2 required positional arguments: 'mask' and 'value'
----------------------------------------------------------------------

as to the linter failure, you may just run dev/reformat-python

zhengruifeng avatar Jul 21 '22 09:07 zhengruifeng

also cc @Yikun

zhengruifeng avatar Jul 21 '22 09:07 zhengruifeng

new methods should also be listed in python/docs/source/reference/pyspark.pandas/indexing.rst

Oh, yes. I forgot. Thanks

bzhaoopenstack avatar Jul 22 '22 08:07 bzhaoopenstack

this failure needs be fixed

======================================================================
ERROR [0.436s]: test_missing (pyspark.pandas.tests.indexes.test_base.IndexesTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/__w/spark/spark/python/pyspark/pandas/tests/indexes/test_base.py", line 508, in test_missing
    getattr(psdf.set_index(["a", "b"]).index, name)()
TypeError: putmask() missing 2 required positional arguments: 'mask' and 'value'
----------------------------------------------------------------------

as to the linter failure, you may just run dev/reformat-python

Hi Zheng, Thank you very much. But I think I need some helps about the MultiIndex putmask. I didn't find any usecase about MultiIndex putmask, and the behaviors between pandas and numpy are different too. So I'm still not verify the putmask func I introduced work well. Could you pls show some usecases for helping this?

BTW, This due to I didn't remove the 'putmask' in MultiIndex missing List. ;-).

bzhaoopenstack avatar Jul 22 '22 08:07 bzhaoopenstack

@bzhaoopenstack

I didn't find any usecase about MultiIndex putmask

does pandas support MultiIndex putmask

the behaviors between pandas and numpy are different too.

do you mean the MultiIndex?

zhengruifeng avatar Jul 25 '22 07:07 zhengruifeng

@bzhaoopenstack

I didn't find any usecase about MultiIndex putmask

does pandas support MultiIndex putmask

From the release doc and API doc, yeah, it claims that MultiIndex supports putmask(doesn't support where), but no any sample. So it's hard to verify the functionality for me, as I never call multiIndex.putmask with pandas successfully.

the behaviors between pandas and numpy are different too.

do you mean the MultiIndex?

Yeah, from my exploration, the numpy.putmask behavior is different with pandas MultiIndex.putmask. As those can success in numpy, but fail in pandas.

bzhaoopenstack avatar Jul 25 '22 08:07 bzhaoopenstack

Yeah, from my exploration, the numpy.putmask behavior is different with pandas MultiIndex.putmask. As those can success in numpy, but fail in pandas.

I guess we don't care the difference between pandas and numpy. We'd better make the behavior of PS(Pandas API on Spark) the same as Pandas.

zhengruifeng avatar Jul 28 '22 07:07 zhengruifeng

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

github-actions[bot] avatar Nov 06 '22 00:11 github-actions[bot]