fastparquet icon indicating copy to clipboard operation
fastparquet copied to clipboard

int dtype in a categorical column is lost when used as partition

Open yohplala opened this issue 4 years ago • 2 comments

What happened: Test case broken with new version of fastparquet.

Minimal Complete Verifiable Example: Here is an example showing the trouble:

import os
import pandas as pd
import fastparquet as fp

path = os.path.expanduser('~/Documents/code/data/fastparquet')
file = path + '/test.parquet'

df = pd.DataFrame({'val':range(5),
                   'cat':[1,1,1,2,2]}).astype({'cat':'category'})
fp.write(file, df, file_scheme='hive', partition_on=['cat'])
df1 = fp.ParquetFile(file).to_pandas()

df['cat']
Out[102]: 
0    1
1    1
2    1
3    2
4    2
Name: cat, dtype: category
Categories (2, int64): [1, 2]

df1['cat']
Out[103]: 
0    1
1    1
2    1
3    2
4    2
Name: cat, dtype: category
Categories (2, object): ['2', '1']

Environment:

  • fastparquet version: 0.7.0
  • Python version: 3.8.8
  • Operating System: Ubuntu 20.04
  • Install method (conda, pip, source): conda develop .

yohplala avatar Aug 01 '21 18:08 yohplala

Sorry that this one missed the cut. Since it used to work, I expect the fix ought to be fairly simple. I'm not sure when I'll get to it.

martindurant avatar Aug 09 '21 12:08 martindurant

No worries Martin! I feel it nice that my test cases (for another lib based on fastparquet) supplements those of fastparquet :).

yohplala avatar Aug 09 '21 20:08 yohplala