fastparquet
fastparquet copied to clipboard
int dtype in a categorical column is lost when used as partition
What happened: Test case broken with new version of fastparquet.
Minimal Complete Verifiable Example: Here is an example showing the trouble:
import os
import pandas as pd
import fastparquet as fp
path = os.path.expanduser('~/Documents/code/data/fastparquet')
file = path + '/test.parquet'
df = pd.DataFrame({'val':range(5),
'cat':[1,1,1,2,2]}).astype({'cat':'category'})
fp.write(file, df, file_scheme='hive', partition_on=['cat'])
df1 = fp.ParquetFile(file).to_pandas()
df['cat']
Out[102]:
0 1
1 1
2 1
3 2
4 2
Name: cat, dtype: category
Categories (2, int64): [1, 2]
df1['cat']
Out[103]:
0 1
1 1
2 1
3 2
4 2
Name: cat, dtype: category
Categories (2, object): ['2', '1']
Environment:
- fastparquet version: 0.7.0
- Python version: 3.8.8
- Operating System: Ubuntu 20.04
- Install method (conda, pip, source):
conda develop .
Sorry that this one missed the cut. Since it used to work, I expect the fix ought to be fairly simple. I'm not sure when I'll get to it.
No worries Martin! I feel it nice that my test cases (for another lib based on fastparquet) supplements those of fastparquet :).