pandas icon indicating copy to clipboard operation
pandas copied to clipboard

BUG: Inconsistency in DataFrame.where between inplace and not inplace with na like value for StringArray

Open simonjayhawkins opened this issue 3 years ago • 4 comments

Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of pandas.

  • [X] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

print(pd.__version__)
df = pd.DataFrame({"A": ["1", "", "3"]}, dtype="string")
try:
    result = df.where(df != "", np.nan)
    arr = result["A"]._values
    print(arr)
    print(type(arr[1]))
except Exception as e:
    print(e)
df.where(df != "", np.nan, inplace=True)
print(df)
arr = df["A"]._values
print(arr)
print(type(arr[1]))

Issue Description

code sample based on #46366

1.4.1
StringArray requires a sequence of strings or pandas.NA
     A
0    1
1  NaN
2    3
<StringArray>
['1', nan, '3']
Length: 3, dtype: string
<class 'float'>
1.5.0.dev0+595.gf99ec8bf80
<StringArray>
['1', <NA>, '3']
Length: 3, dtype: string
<class 'pandas._libs.missing.NAType'>
     A
0    1
1  NaN
2    3
<StringArray>
['1', nan, '3']
Length: 3, dtype: string
<class 'float'>

Expected Behavior

The behavior for the inplace=False case has changed from 1.4.1 to main since #45168 allows other na values in the StringArray Constructor.

Whether this is correct for the DataFrame.where case may need discussion. Either way, the results for the inplace=True case look incorrect to me and should be consistent with the inplace=False case.

Installed Versions

.

simonjayhawkins avatar Mar 25 '22 19:03 simonjayhawkins

The behavior on main has changed since this issue was opened https://github.com/pandas-dev/pandas/pull/47793#discussion_r925067971

1.5.0.dev0+1176.gf7e0e68f34
<StringArray>
['1', <NA>, '3']
Length: 3, dtype: string
<class 'pandas._libs.missing.NAType'>
      A
0     1
1  <NA>
2     3
<StringArray>
['1', <NA>, '3']
Length: 3, dtype: string
<class 'pandas._libs.missing.NAType'>

The underlying StringArray is now correct in the sense that the array elements are only string values or pd.NA.

I'll bisect to confirm where fixed, but assuming #47763

Whether this is correct for the DataFrame.where case may need discussion. Either way, the results for the inplace=True case look incorrect to me and should be consistent with the inplace=False case.

So just need to confirm here that DataFrame.where should treat np.nan as a missing value indicator (the current behavior on main) or whether the np.nan should be considered an explicit assignment and the result should be object dtype (since a StringArray cannot hold float values, np.nan is a float).

simonjayhawkins avatar Jul 20 '22 11:07 simonjayhawkins

I'll bisect to confirm where fixed, but assuming #47763

can confirm. fixed in commit: [1b1dd36016bbf0b216d654b84d62a005fa5b48a0] BUG: fix regression in Series[string] setitem setting a scalar with a mask (#47763)

simonjayhawkins avatar Jul 20 '22 12:07 simonjayhawkins

Looks right on main. could use a test (first check to see if one already exists)

jbrockmendel avatar Oct 29 '25 20:10 jbrockmendel

Hi, I'd like to work on this issue. I've verified the bug is fixed in main (3.0.0.dev0) - both inplace=True and inplace=False now correctly return pd.NA with NAType. I'll add a regression test to prevent this from breaking again.

zachyattack23 avatar Dec 04 '25 03:12 zachyattack23