BUG: NA value doesn't match mask condition, still masked
Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
>>> import pandas as pd
>>> s = pd.Series([123456789, -987654321, 314159, pd.NA, -234892, 55555], name='int64_col', dtype=pd.Int64Dtype())
>>> s
0 123456789
1 -987654321
2 314159
3 <NA>
4 -234892
5 55555
Name: int64_col, dtype: Int64
>>> s.mask(s%2 == 1)
0 <NA>
1 <NA>
2 <NA>
3 <NA>
4 -234892
5 <NA>
Name: int64_col, dtype: Int64
>>> s.mask(s%2 == 1, -1)
0 -1
1 -1
2 -1
3 -1
4 -234892
5 -1
Name: int64_col, dtype: Int64
>>> pd.__version__
'2.0.1'
>>>
Issue Description
Series.mask API is masking NA which does not match the mask condition. See the repro example.
Expected Behavior
>>> import pandas as pd
>>> s = pd.Series([123456789, -987654321, 314159, pd.NA, -234892, 55555], name='int64_col', dtype=pd.Int64Dtype())
>>> s
0 123456789
1 -987654321
2 314159
3 <NA>
4 -234892
5 55555
Name: int64_col, dtype: Int64
>>> s.mask(s%2 == 1)
0 <NA>
1 <NA>
2 <NA>
3 <NA>
4 -234892
5 <NA>
Name: int64_col, dtype: Int64
>>> s.mask(s%2 == 1, -1)
0 -1
1 -1
2 -1
3 <NA>
4 -234892
5 -1
Name: int64_col, dtype: Int64
Installed Versions
pd.__version__ '2.0.1'
Thanks for the bug report, @shobsi .
Yes, this is clearly not the intended behavior. A PR would be welcome.
Hi, I'm interested in this issue and looked into the code.
https://github.com/pandas-dev/pandas/blob/607316c9b6f5839389161441da8cff2eff5eccd0/pandas/core/generic.py#L9888-L9890
I thinks here is the suspicious line: it is filling NA values with inplace, which I don't understand why. Wouldn't this result in potentially different behaviors when using inplace=True and inplace=False?
As for this specific issue, I'm thinking that we can call fillna on cond before we pass it to _where. The problem is, cond could be any array-like, and only NDFrame has the fillna method. Nevertheless, I think we are not allowed to use objects that cannot be evaluated as a boolean when using in cond when cond is not an NDFrame. Therefore, can we do the following?
def where(self, cond, other=np.nan, *, inplace=False, axis=None, level=None):
other = common.apply_if_callable(other, self)
if isinstance(cond, NDFrame):
cond = cond.fillna(True)
return self._where(cond, other, inplace, axis, level)
Not sure if this is a neat approach. Though it passed all test cases that either calls where or mask, I'm not sure if it can cause other potential problems.
I think that sound great that you want to take this! I agree that fill_value = bool(inplace) looks very strange.
_where may be called from internal functions directly, so I think doing the fillna in _where would be better than in where IMO.
Ah true, thanks for your reminder! I'll open a PR to add those lines in _where. As for fill_value = bool(inplace), I may try to construct some example to see if it can cause inconsistencies when using inplace=True and inplace=False and open another issue if so.
take
@topper-123 I think I may understand now what fill_value = bool(inplace) is trying to do. For instance, _setitem_frame is calling _where with inplace=True, so that pd.NA will propagate if one does something like the following:
>>> df = pd.DataFrame(np.random.random((3, 3)), dtype=pd.Float64Dtype())
>>> df[0][0] = pd.NA
>>> df
0 1 2
0 <NA> 0.609241 0.419094
1 0.274784 0.342904 0.026101
2 0.670259 0.218889 0.177126
>>> df[df >= 0.5] = 0
>>> df
0 1 2
0 <NA> 0.0 0.419094
1 0.274784 0.342904 0.026101
2 0.0 0.218889 0.177126
But still I doubt this will cause some problems.