BUG: Inconsistency in DataFrame.where between inplace and not inplace with na like value for StringArray
Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
print(pd.__version__)
df = pd.DataFrame({"A": ["1", "", "3"]}, dtype="string")
try:
result = df.where(df != "", np.nan)
arr = result["A"]._values
print(arr)
print(type(arr[1]))
except Exception as e:
print(e)
df.where(df != "", np.nan, inplace=True)
print(df)
arr = df["A"]._values
print(arr)
print(type(arr[1]))
Issue Description
code sample based on #46366
1.4.1
StringArray requires a sequence of strings or pandas.NA
A
0 1
1 NaN
2 3
<StringArray>
['1', nan, '3']
Length: 3, dtype: string
<class 'float'>
1.5.0.dev0+595.gf99ec8bf80
<StringArray>
['1', <NA>, '3']
Length: 3, dtype: string
<class 'pandas._libs.missing.NAType'>
A
0 1
1 NaN
2 3
<StringArray>
['1', nan, '3']
Length: 3, dtype: string
<class 'float'>
Expected Behavior
The behavior for the inplace=False case has changed from 1.4.1 to main since #45168 allows other na values in the StringArray Constructor.
Whether this is correct for the DataFrame.where case may need discussion. Either way, the results for the inplace=True case look incorrect to me and should be consistent with the inplace=False case.
Installed Versions
.
The behavior on main has changed since this issue was opened https://github.com/pandas-dev/pandas/pull/47793#discussion_r925067971
1.5.0.dev0+1176.gf7e0e68f34
<StringArray>
['1', <NA>, '3']
Length: 3, dtype: string
<class 'pandas._libs.missing.NAType'>
A
0 1
1 <NA>
2 3
<StringArray>
['1', <NA>, '3']
Length: 3, dtype: string
<class 'pandas._libs.missing.NAType'>
The underlying StringArray is now correct in the sense that the array elements are only string values or pd.NA.
I'll bisect to confirm where fixed, but assuming #47763
Whether this is correct for the DataFrame.where case may need discussion. Either way, the results for the
inplace=Truecase look incorrect to me and should be consistent with theinplace=Falsecase.
So just need to confirm here that DataFrame.where should treat np.nan as a missing value indicator (the current behavior on main) or whether the np.nan should be considered an explicit assignment and the result should be object dtype (since a StringArray cannot hold float values, np.nan is a float).
I'll bisect to confirm where fixed, but assuming #47763
can confirm. fixed in commit: [1b1dd36016bbf0b216d654b84d62a005fa5b48a0] BUG: fix regression in Series[string] setitem setting a scalar with a mask (#47763)
Looks right on main. could use a test (first check to see if one already exists)
Hi, I'd like to work on this issue. I've verified the bug is fixed in main (3.0.0.dev0) - both inplace=True and inplace=False now correctly return pd.NA with NAType. I'll add a regression test to prevent this from breaking again.