pandas icon indicating copy to clipboard operation
pandas copied to clipboard

BUG: `pd.Series.where` incorrectly casts `<NA>` to float `NaN`

Open asddfl opened this issue 1 month ago • 1 comments

Pandas version checks

  • [x] I have checked that this issue has not already been reported.

  • [x] I have confirmed this bug exists on the latest version of pandas.

  • [x] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

print(pd.__version__)
t1 = pd.DataFrame(
    {
        'c0': ['aa'],
        'c1': ['bb']
    }
)

result = t1.assign(c0_t1=lambda df: df['c0'].where(t1['c0'].isin(['c1']), other=pd.NA))
print(result)
print(result['c0_t1'].apply(type))
2.3.3
   c0  c1 c0_t1
0  aa  bb  <NA>
0    <class 'pandas._libs.missing.NAType'>
Name: c0_t1, dtype: object
3.0.0.dev0+2777.g8813fafe66
   c0  c1 c0_t1
0  aa  bb   NaN
0    <class 'float'>
Name: c0_t1, dtype: object

Issue Description

When using .where(..., other=pd.NA) on a string column, pandas 3.0.0.dev0+2777.g8813fafe66 incorrectly casts the output to float and replaces <NA> with NaN. This behavior is different from pandas 2.3.3.

Expected Behavior

   c0  c1 c0_t1
0  aa  bb  <NA>
0    <class 'pandas._libs.missing.NAType'>
Name: c0_t1, dtype: object

Installed Versions

INSTALLED VERSIONS
------------------
commit                : 9c8bc3e55188c8aff37207a74f1dd144980b8874
python                : 3.10.19
python-bits           : 64
OS                    : Linux
OS-release            : 6.14.0-35-generic
Version               : #35~24.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue Oct 14 13:55:17 UTC 2
machine               : x86_64
processor             : x86_64
byteorder             : little
LC_ALL                : None
LANG                  : en_US.UTF-8
LOCALE                : en_US.UTF-8

pandas                : 2.3.3
numpy                 : 1.26.4
pytz                  : 2025.2
dateutil              : 2.9.0.post0
pip                   : 25.3
Cython                : None
sphinx                : None
IPython               : 8.27.0
adbc-driver-postgresql: None
adbc-driver-sqlite    : None
bs4                   : 4.14.2
blosc                 : None
bottleneck            : None
dataframe-api-compat  : None
fastparquet           : None
fsspec                : 2025.10.0
html5lib              : None
hypothesis            : None
gcsfs                 : None
jinja2                : 3.1.6
lxml.etree            : None
matplotlib            : 3.10.7
numba                 : 0.61.2
numexpr               : None
odfpy                 : None
openpyxl              : None
pandas_gbq            : None
psycopg2              : None
pymysql               : None
pyarrow               : 22.0.0
pyreadstat            : None
pytest                : None
python-calamine       : None
pyxlsb                : None
s3fs                  : None
scipy                 : 1.15.3
sqlalchemy            : 2.0.44
tables                : None
tabulate              : 0.9.0
xarray                : 2025.6.1
xlrd                  : None
xlsxwriter            : None
zstandard             : 0.25.0
tzdata                : 2025.2
qtpy                  : None
pyqt5                 : None
INSTALLED VERSIONS
------------------
commit                : 8813fafe66a361f652dc5d83e41ec11f8055725c
python                : 3.12.0
python-bits           : 64
OS                    : Linux
OS-release            : 6.14.0-35-generic
Version               : #35~24.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue Oct 14 13:55:17 UTC 2
machine               : x86_64
processor             : x86_64
byteorder             : little
LC_ALL                : None
LANG                  : en_US.UTF-8
LOCALE                : en_US.UTF-8

pandas                : 3.0.0.dev0+2777.g8813fafe66
numpy                 : 1.26.4
dateutil              : 2.9.0.post0
pip                   : 25.3
Cython                : None
sphinx                : None
IPython               : None
adbc-driver-postgresql: None
adbc-driver-sqlite    : None
bs4                   : None
bottleneck            : None
fastparquet           : None
fsspec                : None
html5lib              : None
hypothesis            : None
gcsfs                 : None
jinja2                : None
lxml.etree            : None
matplotlib            : None
numba                 : None
numexpr               : None
odfpy                 : None
openpyxl              : None
psycopg2              : None
pymysql               : None
pyarrow               : None
pyiceberg             : None
pyreadstat            : None
pytest                : None
python-calamine       : None
pytz                  : None
pyxlsb                : None
s3fs                  : None
scipy                 : None
sqlalchemy            : None
tables                : None
tabulate              : None
xarray                : None
xlrd                  : None
xlsxwriter            : None
zstandard             : None
qtpy                  : None
pyqt5                 : None

asddfl avatar Dec 03 '25 16:12 asddfl

This is due to the new string dtype.

https://pandas.pydata.org/docs/dev/user_guide/migration-3-strings.html#the-missing-value-sentinel-is-now-always-nan

t1 = pd.DataFrame(
    {
        'c0': ['aa'],
        'c1': ['bb']
    }
)
print(t1.dtypes)
# c0    str
# c1    str
# dtype: object

If you want <NA> as the NA-sentinel, you can opt into the string dtype.

t1 = pd.DataFrame(
    {
        'c0': ['aa'],
        'c1': ['bb']
    },
    dtype="string",
)
result = t1.assign(c0_t1=lambda df: df['c0'].where(t1['c0'].isin(['c1']), other=pd.NA))
print(result)
#    c0  c1 c0_t1
# 0  aa  bb  <NA>

You can recover 2.3.x behavior by setting pd.set_option("future.infer_string", False), however this will eventually be going away.

rhshadrach avatar Dec 04 '25 22:12 rhshadrach