pandas icon indicating copy to clipboard operation
pandas copied to clipboard

BUG: CONTAINS_OP run on pd.NA results in pd.NAType.__bool__ call

Open filip-komarzyniec opened this issue 1 year ago • 5 comments

Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of pandas.

  • [X] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

pd.NA in [1,2,3]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "missing.pyx", line 392, in pandas._libs.missing.NAType.__bool__
TypeError: boolean value of NA is ambiguous

Issue Description

checking for pd.NA existence in a list results in TypeError: boolean value of NA is ambiguous.
Why is performing in operation calls __bool__ method of the pd.NAType class?

Seems a bit similar to the issue regarding incorrect implementation of some operators: https://github.com/pandas-dev/pandas/issues/49828

Expected Behavior

Checking for existence of pd.NA type in any container should correctly return either True or False

Installed Versions

INSTALLED VERSIONS

commit : bdc79c146c2e32f2cab629be240f01658cfb6cc2 python : 3.10.13.final.0 python-bits : 64 OS : Darwin OS-release : 23.2.0 Version : Darwin Kernel Version 23.2.0: Wed Nov 15 21:55:06 PST 2023; root:xnu-10002.61.3~2/RELEASE_ARM64_T6020 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : None LOCALE : None.UTF-8

pandas : 2.2.1 numpy : 1.26.3 pytz : 2024.1 dateutil : 2.8.2 setuptools : 68.2.2 pip : 23.3.1 Cython : None pytest : 8.0.0 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : None bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : 1.12.0 sqlalchemy : None tables : None tabulate : 0.9.0 xarray : None xlrd : None zstandard : None tzdata : 2024.1 qtpy : None pyqt5 : None

filip-komarzyniec avatar Mar 25 '24 00:03 filip-komarzyniec

Thanks for the report - this is a consequence of having comparisons return pd.NA:

print(pd.NA == 1)
# <NA>

When Python checks "is pd.NA == 1", the result is NA, which Python then evaluates the truthiness of this result, giving you the TypeError as reported. As long as we are returning pd.NA on comparisons, I do not believe anything can be done here.

cc @jorisvandenbossche @phofl

rhshadrach avatar Mar 25 '24 01:03 rhshadrach

We intend to change this to return false (discussed in Basel), should probably get this into 3.0

phofl avatar Mar 25 '24 01:03 phofl

take

20revsined avatar Mar 29 '24 19:03 20revsined

We intend to change this to return false (discussed in Basel), should probably get this into 3.0

@phofl Would this change only apply for boolean ops or do you anticipating changing the behavior of numerical ops like 1 + pd.NA as well?

asishm avatar Apr 06 '24 16:04 asishm

not it's only

bool(pd.NA) that we want to change.

@20revsined this is probably not a good issue for a beginner in pandas

phofl avatar Apr 06 '24 19:04 phofl

I don't know if my issue is related to this, please remove my comment if not!

I have a function which gives me the following output (pd df):

timestamp duration trial_type blink message
9199380 <NA> NaN <NA> RECORD_START
9199345 392 fixation 0 NaN
etc...

column dtypes are: timestamp Int64 duration Int64 trial_type object blink Int64 message object dtype: object

To be precise: timestamp and duration hold numerics plus nans, trial_type holds strings plus nans, blink holds numerics (0 and 1) plus nans, and message hold strings plus nans.

Now I wrote a unit test to test the output for the first row:

@pytest.mark.parametrize(     
"folder, expected",     
[("emg", [9199380, pd.NA, np.nan, pd.NA, "RECORD_START"])]
# + *other folders, removed for simplicity*)

def test_physioevents_value(folder, expected, eyelink_test_data_dir):
    input_dir = eyelink_test_data_dir / folder
    asc_file = asc_test_files(input_dir=input_dir, suffix="*_events")[0]
    events = _load_asc_file(asc_file)
    events_after_start = _df_events_after_start(events)
    physioevents_reordered = _df_physioevents(events_after_start)
    physioevents_eye1 = _physioevents_eye1(physioevents_reordered)
    assert physioevents_eye1.iloc[0].tolist() == expected

And the list obviously looks like this: [9199380, <NA>, nan, <NA>, 'RECORD_START']

I get the following error when running the test:

E AssertionError: assert [9199380, <NA>...CORD_START'] == [9199380, <NA>...CORD_START'] E
E (pytest_assertion plugin: representation of details failed: missing.pyx:392: TypeError: boolean value of NA is ambiguous. E Probably an object has a faulty repr.)

tests/test_edf2bids.py:670: AssertionError

So I guess I cannot use pd.NA to check if the value in that field is <NA>. However, I also cannot check it using "<NA>", i.e. encoding it as a string.

How I can check if pd.NAs s in the dataframe exist?

I tried changing the dtypes so that every column has the dtype 'object'. However, that's not really what I want.

julia-pfarr avatar Jun 19 '24 10:06 julia-pfarr

While somewhat related, this:

How I can check if pd.NAs s in the dataframe exist?

is more of a usage question. Please try asking on StackOverflow first - if you don't get your question resolved in a few days, open a new issue here and link to your SO post. We do this as otherwise we fear our issue tracker would be flooded with usage questions.

rhshadrach avatar Jun 19 '24 20:06 rhshadrach

Great, thank you for your reply! I already asked on SO a couple of days ago. I'll wait a bit more and then do as you asked if I don't get it resolved otherwise :-)

julia-pfarr avatar Jun 20 '24 09:06 julia-pfarr