pandas icon indicating copy to clipboard operation
pandas copied to clipboard

BUG: NA value doesn't match mask condition, still masked

Open shobsi opened this issue 2 years ago • 6 comments

Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of pandas.

  • [X] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

>>> import pandas as pd
>>> s = pd.Series([123456789, -987654321, 314159, pd.NA, -234892, 55555], name='int64_col', dtype=pd.Int64Dtype())
>>> s
0     123456789
1    -987654321
2        314159
3          <NA>
4       -234892
5         55555
Name: int64_col, dtype: Int64
>>> s.mask(s%2 == 1)
0       <NA>
1       <NA>
2       <NA>
3       <NA>
4    -234892
5       <NA>
Name: int64_col, dtype: Int64
>>> s.mask(s%2 == 1, -1)
0         -1
1         -1
2         -1
3         -1
4    -234892
5         -1
Name: int64_col, dtype: Int64
>>> pd.__version__
'2.0.1'
>>>

Issue Description

Series.mask API is masking NA which does not match the mask condition. See the repro example.

Expected Behavior

>>> import pandas as pd
>>> s = pd.Series([123456789, -987654321, 314159, pd.NA, -234892, 55555], name='int64_col', dtype=pd.Int64Dtype())
>>> s
0     123456789
1    -987654321
2        314159
3          <NA>
4       -234892
5         55555
Name: int64_col, dtype: Int64
>>> s.mask(s%2 == 1)
0       <NA>
1       <NA>
2       <NA>
3       <NA>
4    -234892
5       <NA>
Name: int64_col, dtype: Int64
>>> s.mask(s%2 == 1, -1)
0         -1
1         -1
2         -1
3         <NA>
4    -234892
5         -1
Name: int64_col, dtype: Int64

Installed Versions

pd.__version__ '2.0.1'

shobsi avatar Apr 27 '23 03:04 shobsi

Thanks for the bug report, @shobsi .

Yes, this is clearly not the intended behavior. A PR would be welcome.

topper-123 avatar May 04 '23 11:05 topper-123

Hi, I'm interested in this issue and looked into the code.

https://github.com/pandas-dev/pandas/blob/607316c9b6f5839389161441da8cff2eff5eccd0/pandas/core/generic.py#L9888-L9890

I thinks here is the suspicious line: it is filling NA values with inplace, which I don't understand why. Wouldn't this result in potentially different behaviors when using inplace=True and inplace=False?

As for this specific issue, I'm thinking that we can call fillna on cond before we pass it to _where. The problem is, cond could be any array-like, and only NDFrame has the fillna method. Nevertheless, I think we are not allowed to use objects that cannot be evaluated as a boolean when using in cond when cond is not an NDFrame. Therefore, can we do the following?

    def where(self, cond, other=np.nan, *, inplace=False, axis=None, level=None):
        other = common.apply_if_callable(other, self)

        if isinstance(cond, NDFrame):
            cond = cond.fillna(True)

        return self._where(cond, other, inplace, axis, level)

Not sure if this is a neat approach. Though it passed all test cases that either calls where or mask, I'm not sure if it can cause other potential problems.

Charlie-XIAO avatar May 06 '23 18:05 Charlie-XIAO

I think that sound great that you want to take this! I agree that fill_value = bool(inplace) looks very strange.

_where may be called from internal functions directly, so I think doing the fillna in _where would be better than in where IMO.

topper-123 avatar May 06 '23 19:05 topper-123

Ah true, thanks for your reminder! I'll open a PR to add those lines in _where. As for fill_value = bool(inplace), I may try to construct some example to see if it can cause inconsistencies when using inplace=True and inplace=False and open another issue if so.

Charlie-XIAO avatar May 06 '23 21:05 Charlie-XIAO

take

Charlie-XIAO avatar May 06 '23 21:05 Charlie-XIAO

@topper-123 I think I may understand now what fill_value = bool(inplace) is trying to do. For instance, _setitem_frame is calling _where with inplace=True, so that pd.NA will propagate if one does something like the following:

>>> df = pd.DataFrame(np.random.random((3, 3)), dtype=pd.Float64Dtype())
>>> df[0][0] = pd.NA
>>> df
          0         1         2
0      <NA>  0.609241  0.419094
1  0.274784  0.342904  0.026101
2  0.670259  0.218889  0.177126
>>> df[df >= 0.5] = 0
>>> df
          0         1         2
0      <NA>       0.0  0.419094
1  0.274784  0.342904  0.026101
2       0.0  0.218889  0.177126

But still I doubt this will cause some problems.

Charlie-XIAO avatar May 07 '23 00:05 Charlie-XIAO