pandas icon indicating copy to clipboard operation
pandas copied to clipboard

BUG: pd.merge fail with numpy.uintc on Windows

Open Noname37486 opened this issue 1 year ago • 2 comments

Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of pandas.

  • [ ] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

This bug is fixed by adding np.uintc to _factorizers in pd.merge. Could someone upload it fixed please?. For me the bug is fixed by adding in line 114 of core/reshape/merge.py file: np.uintc:libhastable.UInt32Factorizer.

import numpy as np
import pandas as pd

df1 = pd.DataFrame({'a':['foo','bar'],'b':np.array([1,2], dtype=np.uintc)})
df2 = pd.DataFrame({'a':['foo','baz'],'b':np.array([3,4], dtype=np.uintc)})
df3=df1.merge(df2, how = 'outer')
print(df3)

Issue Description

Traceback (most recent call last): File "C:\Users\noname37486\PycharmProjects\pythonProject3\main.py", line 10, in df3=df1.merge(df2, how = 'outer') File "C:\Users\noname37486\PycharmProjects\pythonProject3\venv\lib\site-packages\pandas\core\frame.py", line 10832, in merge return merge( File "C:\Users\noname37486\PycharmProjects\pythonProject3\venv\lib\site-packages\pandas\core\reshape\merge.py", line 184, in merge return op.get_result(copy=copy) File "C:\Users\noname37486\PycharmProjects\pythonProject3\venv\lib\site-packages\pandas\core\reshape\merge.py", line 886, in get_result join_index, left_indexer, right_indexer = self._get_join_info() File "C:\Users\noname37486\PycharmProjects\pythonProject3\venv\lib\site-packages\pandas\core\reshape\merge.py", line 1151, in _get_join_info (left_indexer, right_indexer) = self._get_join_indexers() File "C:\Users\noname37486\PycharmProjects\pythonProject3\venv\lib\site-packages\pandas\core\reshape\merge.py", line 1125, in _get_join_indexers return get_join_indexers( File "C:\Users\noname37486\PycharmProjects\pythonProject3\venv\lib\site-packages\pandas\core\reshape\merge.py", line 1740, in get_join_indexers zipped = zip(*mapped) File "C:\Users\v\PycharmProjects\pythonProject3\venv\lib\site-packages\pandas\core\reshape\merge.py", line 1737, in _factorize_keys(left_keys[n], right_keys[n], sort=sort) File "C:\Users\noname37486\PycharmProjects\pythonProject3\venv\lib\site-packages\pandas\core\reshape\merge.py", line 2539, in _factorize_keys klass, lk, rk = _convert_arrays_and_get_rizer_klass(lk, rk) File "C:\Users\noname37486\PycharmProjects\pythonProject3\venv\lib\site-packages\pandas\core\reshape\merge.py", line 2616, in _convert_arrays_and_get_rizer_klass klass = _factorizers[lk.dtype.type] KeyError: <class 'numpy.uintc'>

Installed Versions

INSTALLED VERSIONS

commit : d9cdd2ee5a58015ef6f4d15c7226110c9aab8140 python : 3.9.6.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.22631 machine : AMD64 processor : Intel64 Family 6 Model 165 Stepping 2, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : es_ES.cp1252

pandas : 2.2.2 numpy : 1.26.4 pytz : 2024.1 dateutil : 2.9.0.post0 setuptools : 57.0.0 pip : 21.1.2 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : None bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2024.1 qtpy : None pyqt5 : None

Noname37486 avatar May 14 '24 08:05 Noname37486

I have made a pull request on this issue, kindly check

  • #58727

Tirthchoksi22 avatar May 15 '24 07:05 Tirthchoksi22

xref https://github.com/pandas-dev/pandas/issues/52451#issuecomment-1720767399

simonjayhawkins avatar May 15 '24 09:05 simonjayhawkins

From https://github.com/pandas-dev/pandas/issues/60091#issue-2609739687

This issue can be resolved by adding uintc similar to the code here for intc:

https://github.com/pandas-dev/pandas/blob/8d2ca0bf84bcf44a800ac19bdb4ed7ec88c555e2/pandas/core/reshape/merge.py#L124-L126

However, we should be checking np.dtype(np.intc).itemsize when doing so and using this to determine the right dtype to map to: if it is 4 we map to libhashtable.UInt32Factorizer and if it is 8 we map to libhashtable.UInt64Factorizer.

rhshadrach avatar Oct 27 '24 13:10 rhshadrach

Resolution of this issue should also address the np.intc case with checking itemsize as well.

rhshadrach avatar Oct 27 '24 13:10 rhshadrach