BUG: `read_parquet` doesn't convert categories to `pd.CategoricalDtype` when `dtype_backend="pyarrow"`

Open nachomaiz opened this issue 1 year ago • 0 comments

Pandas version checks

[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of pandas.
[ ] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

# prepare test data
test_data = pd.DataFrame({"categorical": ["a", "b", "c", "a", "b", "c"]}, dtype="category")

# confirm correct categorical dtype
print(test_data.select_dtypes("category").shape)  # (6, 1)
print(test_data.select_dtypes(pd.CategoricalDtype).shape)  # (6, 1) [warns]
print(test_data["categorical"].dtype)  # category

# parquet round trip with pyarrow backend
test_data.to_parquet("test.parquet")
loaded_data = pd.read_parquet("test.parquet", dtype_backend="pyarrow")

# check categorical dtype
print(loaded_data.select_dtypes("category").shape)  # (6, 0) [doesn't recognize categorical column]
print(loaded_data.select_dtypes(pd.CategoricalDtype).shape)  # (6, 0) [warns and doesn't recognize categorical column]
print(loaded_data["categorical"].dtype)  # dictionary<values=string, indices=int32, ordered=0>[pyarrow]

# this check works
from pandas.core.dtypes.dtypes import CategoricalDtypeType
print(loaded_data.select_dtypes(CategoricalDtypeType).shape)  # (6, 1) [recognizes categorical column]

Issue Description

The read_parquet function with dtype_backend="pyarrow" parameter doesn't seem to be assigning CategoricalDtype correctly to categorical columns.

Instead, they remain as Arrow dictionaries, and are not searchable with df.select_dtypes("category").

Expected Behavior

I expected the round trip of df.to_parquet and pd.read_parquet to maintain the categorical dtype, and for the loaded dtype to be selected when using df.select_dtypes("category").

Installed Versions

INSTALLED VERSIONS

commit : d9cdd2ee5a58015ef6f4d15c7226110c9aab8140 python : 3.12.1.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19045 machine : AMD64 processor : Intel64 Family 6 Model 140 Stepping 1, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : English_United States.1252

pandas : 2.2.2 numpy : 1.26.4 pytz : 2024.1 dateutil : 2.8.2 setuptools : 69.0.3 pip : 24.0 Cython : None pytest : 8.0.0 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.3 IPython : 8.21.0 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : None bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : None gcsfs : None matplotlib : 3.8.3 numba : None numexpr : None odfpy : None openpyxl : 3.1.2 pandas_gbq : None pyarrow : 15.0.2 pyreadstat : 1.2.7 python-calamine : None pyxlsb : None s3fs : None scipy : 1.12.0 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : 2.0.1 zstandard : None tzdata : 2024.1 qtpy : None pyqt5 : None

May 08 '24 12:05 nachomaiz