pandas
pandas copied to clipboard
BUG: CONTAINS_OP run on pd.NA results in pd.NAType.__bool__ call
Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
pd.NA in [1,2,3]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "missing.pyx", line 392, in pandas._libs.missing.NAType.__bool__
TypeError: boolean value of NA is ambiguous
Issue Description
checking for pd.NA existence in a list results in TypeError: boolean value of NA is ambiguous.
Why is performing in operation calls __bool__ method of the pd.NAType class?
Seems a bit similar to the issue regarding incorrect implementation of some operators: https://github.com/pandas-dev/pandas/issues/49828
Expected Behavior
Checking for existence of pd.NA type in any container should correctly return either True or False
Installed Versions
INSTALLED VERSIONS
commit : bdc79c146c2e32f2cab629be240f01658cfb6cc2 python : 3.10.13.final.0 python-bits : 64 OS : Darwin OS-release : 23.2.0 Version : Darwin Kernel Version 23.2.0: Wed Nov 15 21:55:06 PST 2023; root:xnu-10002.61.3~2/RELEASE_ARM64_T6020 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : None LOCALE : None.UTF-8
pandas : 2.2.1 numpy : 1.26.3 pytz : 2024.1 dateutil : 2.8.2 setuptools : 68.2.2 pip : 23.3.1 Cython : None pytest : 8.0.0 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : None bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : 1.12.0 sqlalchemy : None tables : None tabulate : 0.9.0 xarray : None xlrd : None zstandard : None tzdata : 2024.1 qtpy : None pyqt5 : None
Thanks for the report - this is a consequence of having comparisons return pd.NA:
print(pd.NA == 1)
# <NA>
When Python checks "is pd.NA == 1", the result is NA, which Python then evaluates the truthiness of this result, giving you the TypeError as reported. As long as we are returning pd.NA on comparisons, I do not believe anything can be done here.
cc @jorisvandenbossche @phofl
We intend to change this to return false (discussed in Basel), should probably get this into 3.0
take
We intend to change this to return false (discussed in Basel), should probably get this into 3.0
@phofl Would this change only apply for boolean ops or do you anticipating changing the behavior of numerical ops like 1 + pd.NA as well?
not it's only
bool(pd.NA) that we want to change.
@20revsined this is probably not a good issue for a beginner in pandas
I don't know if my issue is related to this, please remove my comment if not!
I have a function which gives me the following output (pd df):
| timestamp | duration | trial_type | blink | message |
|---|---|---|---|---|
| 9199380 | <NA> | NaN | <NA> | RECORD_START |
| 9199345 | 392 | fixation | 0 | NaN |
| etc... |
column dtypes are: timestamp Int64 duration Int64 trial_type object blink Int64 message object dtype: object
To be precise: timestamp and duration hold numerics plus nans, trial_type holds strings plus nans, blink holds numerics (0 and 1) plus nans, and message hold strings plus nans.
Now I wrote a unit test to test the output for the first row:
@pytest.mark.parametrize(
"folder, expected",
[("emg", [9199380, pd.NA, np.nan, pd.NA, "RECORD_START"])]
# + *other folders, removed for simplicity*)
def test_physioevents_value(folder, expected, eyelink_test_data_dir):
input_dir = eyelink_test_data_dir / folder
asc_file = asc_test_files(input_dir=input_dir, suffix="*_events")[0]
events = _load_asc_file(asc_file)
events_after_start = _df_events_after_start(events)
physioevents_reordered = _df_physioevents(events_after_start)
physioevents_eye1 = _physioevents_eye1(physioevents_reordered)
assert physioevents_eye1.iloc[0].tolist() == expected
And the list obviously looks like this: [9199380, <NA>, nan, <NA>, 'RECORD_START']
I get the following error when running the test:
E AssertionError: assert [9199380, <NA>...CORD_START'] == [9199380, <NA>...CORD_START'] E
E (pytest_assertion plugin: representation of details failed: missing.pyx:392: TypeError: boolean value of NA is ambiguous. E Probably an object has a faulty repr.)tests/test_edf2bids.py:670: AssertionError
So I guess I cannot use pd.NA to check if the value in that field is <NA>. However, I also cannot check it using "<NA>", i.e. encoding it as a string.
How I can check if pd.NAs s in the dataframe exist?
I tried changing the dtypes so that every column has the dtype 'object'. However, that's not really what I want.
While somewhat related, this:
How I can check if pd.NAs s in the dataframe exist?
is more of a usage question. Please try asking on StackOverflow first - if you don't get your question resolved in a few days, open a new issue here and link to your SO post. We do this as otherwise we fear our issue tracker would be flooded with usage questions.
Great, thank you for your reply! I already asked on SO a couple of days ago. I'll wait a bit more and then do as you asked if I don't get it resolved otherwise :-)