pandas icon indicating copy to clipboard operation
pandas copied to clipboard

BUG: `Index.drop_duplicates()` is inconsistent for unhashable values

Open camriddell opened this issue 8 months ago • 8 comments

Pandas version checks

  • [x] I have checked that this issue has not already been reported.

  • [x] I have confirmed this bug exists on the latest version of pandas.

  • [ ] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

## example A
import pandas as pd # 2.2.3
df = pd.DataFrame([[1, 2, 3]], columns=['a', ['b', 'c'], ['b', 'c']])

print(df.columns.drop_duplicates())
# Traceback (most recent call last):
#   File "/home/cameron/.vim-excerpt", line 5, in <module>
#     print(df.columns.drop_duplicates())
#           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#   File "/home/cameron/repos/opensource/narwhals-dev/.venv/lib/python3.12/site-packages/pandas/core/indexes/base.py", line 3117, in drop_duplicates
#     if self.is_unique:
#        ^^^^^^^^^^^^^^
#   File "properties.pyx", line 36, in pandas._libs.properties.CachedProperty.__get__
#   File "/home/cameron/repos/opensource/narwhals-dev/.venv/lib/python3.12/site-packages/pandas/core/indexes/base.py", line 2346, in is_unique
#     return self._engine.is_unique
#            ^^^^^^^^^^^^^^^^^^^^^^
#   File "index.pyx", line 266, in pandas._libs.index.IndexEngine.is_unique.__get__
#   File "index.pyx", line 271, in pandas._libs.index.IndexEngine._do_unique_check
#   File "index.pyx", line 333, in pandas._libs.index.IndexEngine._ensure_mapping_populated
#   File "pandas/_libs/hashtable_class_helper.pxi", line 7115, in pandas._libs.hashtable.PyObjectHashTable.map_locations
# TypeError: unhashable type: 'list'


## --------
## example B
import pandas as pd # 2.2.3
df = pd.DataFrame([[1, 2, 3]], columns=['a', ['b', 'c'], ['b', 'c']])

# hasattr triggers a side effect where the `df.columns.drop_duplicates()` now works.
hasattr(df, 'hello_world')
print(df.columns.drop_duplicates())
# Index(['a', ['b', 'c']], dtype='object')

Issue Description

pandas.Index.drop_duplicates() inconsistently raises TypeError: unhashable type: 'list' when its values encompass a list. This error does not seem to prevent the underlying uniqueness computation from happening. In addition to the submitted reproducible example there is a direct causation here in the Index object:

If we call .drop_duplicates when the Index contains unhashable types, we observe a TypeError.

import pandas as pd

idx = pd.Index(['a', ['b', 'c'], ['b', 'c']])
idx.drop_duplicates() # TypeError: unhashable type: 'list'

But for some reason if we simply ignore the error the first time and try .drop_duplicates() again it works and removes the duplicated entities including the unhashable ones?

import pandas as pd

idx = pd.Index(['a', ['b', 'c'], ['b', 'c']])
try:
    idx.drop_duplicates()    # TypeError: unhashable type: 'list'
except TypeError:
    pass
print(idx.drop_duplicates()) # Index(['a', ['b', 'c']], dtype='object')

Where we can see that the underlying Index implementation populates its hashtable mapping even though the original call to drop_duplicates fails. We know this population is successful because the second attempt at .drop_duplicates works.

import pandas as pd

idx = pd.Index(['a', ['b', 'c'], ['b', 'c']])
print(idx._engine.mapping)   # None
try:
    idx.drop_duplicates()    # TypeError: unhashable type: 'list'
except TypeError:
    pass
print(idx._engine.mapping)   # <pandas._libs.hashtable.PyObjectHashTable>
print(idx.drop_duplicates()) # Index(['a', ['b', 'c']], dtype='object')

Finally, it appears that attribute checking on a pandas.DataFrame causes the PyObjectHashTable to be constructed for the column index. This is likely due to the shared code path between __getattr__ and __getitem__.

import pandas as pd

df = pd.DataFrame([[1, 2, 3]], columns=['a', ['b', 'c'], ['b', 'c']])
print(df.columns._engine.mapping)   # None
hasattr(df, 'hello_world')
print(df.columns._engine.mapping)   # <pandas._libs.hashtable.PyObjectHashTable>
print(df.columns.drop_duplicates()) # Index(['a', ['b', 'c']], dtype='object')

Expected Behavior

I expect that Index.drop_duplicates() should work regardless of whether an attribute has been checked or not. The following two snippets should produce equivalent results (whether that is to raise an error or to produce a result):

import pandas as pd # 2.2.3
df = pd.DataFrame([[1, 2, 3]], columns=['a', ['b', 'c'], ['b', 'c']])

print(df.columns.drop_duplicates()) # Currently produces → TypeError
import pandas as pd # 2.2.3
df = pd.DataFrame([[1, 2, 3]], columns=['a', ['b', 'c'], ['b', 'c']])

hasattr(df, 'hello_world')
print(df.columns.drop_duplicates()) # Currently produces → Index(['a', ['b', 'c']], dtype='object')

Installed Versions

INSTALLED VERSIONS

commit : 0691c5cf90477d3503834d983f69350f250a6ff7 python : 3.12.7 python-bits : 64 OS : Linux OS-release : 6.6.52-1-lts Version : #1 SMP PREEMPT_DYNAMIC Wed, 18 Sep 2024 19:02:04 +0000 machine : x86_64 processor : byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 2.2.3 numpy : 2.2.2 pytz : 2025.1 dateutil : 2.9.0.post0 pip : 25.0.1 Cython : None sphinx : None IPython : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : None blosc : None bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : 2025.2.0 html5lib : None hypothesis : 6.125.3 gcsfs : None jinja2 : 3.1.5 lxml.etree : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None psycopg2 : None pymysql : None pyarrow : 19.0.0 pyreadstat : None pytest : 8.3.4 python-calamine : None pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlsxwriter : None zstandard : None tzdata : 2025.1 qtpy : None pyqt5 : None

camriddell avatar Feb 13 '25 17:02 camriddell