BUG: `read_parquet` doesn't convert categories to `pd.CategoricalDtype` when `dtype_backend="pyarrow"`
Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[ ] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
# prepare test data
test_data = pd.DataFrame({"categorical": ["a", "b", "c", "a", "b", "c"]}, dtype="category")
# confirm correct categorical dtype
print(test_data.select_dtypes("category").shape) # (6, 1)
print(test_data.select_dtypes(pd.CategoricalDtype).shape) # (6, 1) [warns]
print(test_data["categorical"].dtype) # category
# parquet round trip with pyarrow backend
test_data.to_parquet("test.parquet")
loaded_data = pd.read_parquet("test.parquet", dtype_backend="pyarrow")
# check categorical dtype
print(loaded_data.select_dtypes("category").shape) # (6, 0) [doesn't recognize categorical column]
print(loaded_data.select_dtypes(pd.CategoricalDtype).shape) # (6, 0) [warns and doesn't recognize categorical column]
print(loaded_data["categorical"].dtype) # dictionary<values=string, indices=int32, ordered=0>[pyarrow]
# this check works
from pandas.core.dtypes.dtypes import CategoricalDtypeType
print(loaded_data.select_dtypes(CategoricalDtypeType).shape) # (6, 1) [recognizes categorical column]
Issue Description
The read_parquet function with dtype_backend="pyarrow" parameter doesn't seem to be assigning CategoricalDtype correctly to categorical columns.
Instead, they remain as Arrow dictionaries, and are not searchable with df.select_dtypes("category").
Expected Behavior
I expected the round trip of df.to_parquet and pd.read_parquet to maintain the categorical dtype, and for the loaded dtype to be selected when using df.select_dtypes("category").
Installed Versions
INSTALLED VERSIONS
commit : d9cdd2ee5a58015ef6f4d15c7226110c9aab8140 python : 3.12.1.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19045 machine : AMD64 processor : Intel64 Family 6 Model 140 Stepping 1, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : English_United States.1252
pandas : 2.2.2 numpy : 1.26.4 pytz : 2024.1 dateutil : 2.8.2 setuptools : 69.0.3 pip : 24.0 Cython : None pytest : 8.0.0 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.3 IPython : 8.21.0 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : None bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : None gcsfs : None matplotlib : 3.8.3 numba : None numexpr : None odfpy : None openpyxl : 3.1.2 pandas_gbq : None pyarrow : 15.0.2 pyreadstat : 1.2.7 python-calamine : None pyxlsb : None s3fs : None scipy : 1.12.0 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : 2.0.1 zstandard : None tzdata : 2024.1 qtpy : None pyqt5 : None