fastparquet icon indicating copy to clipboard operation
fastparquet copied to clipboard

Bug in case of a specific set of parameters in 'write': compression/partition_on with str values

Open yohplala opened this issue 4 years ago • 2 comments

Hi, Configuration for this bug to happen have been difficult to formalize, it took me a while. So I notice that filters in to_pandas is ineffective in case the parquet file in the 1st place is written with following configuration:

  • values in the column specified for partition_on are str (object) (I confirm that the bug does not occur in case they are int)
  • compression is used, at least with BROTLI (I confirm that the bug does not occur in case there is no compression)

So in this case, reading the DataFrame back by filtering on values from partition_on column does not work: output DataFrame is empty.

Here is script example to reproduce the bug.

import pandas as pd
import fastparquet as fp
from os import path as os_path

# Setup test data
dr = pd.date_range(start='2021/1/1 08:00', periods=3, freq='2H')
df=pd.DataFrame({'ts': dr})
grps = df.groupby(pd.Grouper(key='ts', freq='4H', origin='start_day'))
df['period'] = grps['ts'].transform(lambda x: str(int(x.name.timestamp())))

# Write
file = os_path.expanduser('~/Documents/code/data/test.parquet')
fp.write(file, df, compression='BROTLI', file_scheme='hive', partition_on=['period'])

# Show bug
periods = df['period'].unique()
df_rec = fp.ParquetFile(file).to_pandas(filters=[('period', 'in', periods)])

Printing df_rec

In [17]: df_rec
Out[17]: 
Empty DataFrame
Columns: [ts, period]
Index: []

If you change one of the following settings, it works:

  • no compression
  • no partition_on
  • instead of str, content of 'period' column is int

For instance, with int:

import pandas as pd
import fastparquet as fp
from os import path as os_path

# Setup test data
dr = pd.date_range(start='2021/1/1 08:00', periods=3, freq='2H')
df=pd.DataFrame({'ts': dr})
grps = df.groupby(pd.Grouper(key='ts', freq='4H', origin='start_day'))
df['period'] = grps['ts'].transform(lambda x: int(x.name.timestamp()))    # just a small change here

# Write
file = os_path.expanduser('~/Documents/code/data/test.parquet')
fp.write(file, df, compression='BROTLI', file_scheme='hive', partition_on=['period'])

# Show no bug
periods = df['period'].unique()
df_rec = fp.ParquetFile(file).to_pandas(filters=[('period', 'in', periods)])

Printing df_rec

In [19]: df_rec
Out[19]: 
                   ts      period
0 2021-01-01 08:00:00  1609488000
1 2021-01-01 10:00:00  1609488000
2 2021-01-01 12:00:00  1609502400

yohplala avatar Jan 31 '21 09:01 yohplala

I am stumped! I have no idea why the compression should matter, since the values encoded in the path by partition_on are not compressed at all. That is might depend on type of value is not as surprising, since fastparquet attempts the convert the (string) values encoded in the path into whatever the original pandas type was, and so for in, the types would need to match to pass the filter. You can check what was inferred with

pf = fastparquet.ParquetFile(..)
pf.cats

See the function filter_out_cats for how the values get used for comparison.

martindurant avatar Feb 01 '21 16:02 martindurant

Hi @martindurant Thanks for the feedback. At the moment, only reporting the bug, being focused on other topics ;). Bests,

yohplala avatar Feb 02 '21 08:02 yohplala