fastparquet ENH: Load the full data (all columns, categories, partitions) with 'read_row_group

When using read_row_group_file(),

it is compulsory to provide columns and categories when using read_row_group_file.
also, the way to specify loading all partitions, or only some of them is not quite clear.

This is shown in below example.

import pandas as pd
import fastparquet as fp
from os import path as os_path

part = [1, 1, 2, 2, 1, 1, 2, 2, 1, 1]
df = pd.DataFrame({'part': part,
                  'val': range(len(part))})

path = os_path.expanduser('~/Documents/code/test_part')
fp.write(path, df, row_group_offsets=2, file_scheme='hive',
         partition_on=['part'])

pf = fp.ParquetFile(path)

# This is not working. Ideally, all data would be loaded (all columns, categories, partitions)
df1 = pf.read_row_group_file(pf.row_groups[0])

Traceback (most recent call last):

  File "<ipython-input-17-16f21a2cfbcd>", line 1, in <module>
    df1 = pf.read_row_group_file(pf.row_groups[0])

TypeError: read_row_group_file() missing 2 required positional arguments: 'columns' and 'categories'

Also, as said, it is not quite clear how loading partitions. There is no partitionsparameter, and when seemingly requesting all columns and all categories, the partitionned column is not loaded.

df1 = pf.read_row_group_file(pf.row_groups[0], columns=pf.columns,
                             categories=pf.categories)
df1
Out[19]: 
   val
0    0
1    1

Please,

could a 'simplified' call to read_row_group_file()be implemented to load all columns and categories and partitions?
how loading partitions as well?

Thanks in advance for your help, Bests,

Aug 16 '21 07:08 yohplala

I feel like this basic work flow could be achieved by dask. Of course, writing separate functionality is perfectly possible, but I wonder if it would end up as extra effort.

Aug 16 '21 12:08 martindurant

I feel like this basic work flow could be achieved by dask. Of course, writing separate functionality is perfectly possible, but I wonder if it would end up as extra effort.

Martin, I do not understand your feedback. read_row_group_file() is already existing. Currently, its signature is:

    def read_row_group_file(self, rg, columns, categories, index=None,
                            assign=None, partition_meta=None, row_filter=False,
                            infile=None):

What I am asking is if we could have instead:

    def read_row_group_file(self, rg, columns=None, categories=None, index=None,
                            assign=None, partition_meta=None, row_filter=False,
                            infile=None):

in which case, all columns and categories and partitions are loaded.

Aug 16 '21 13:08 yohplala

I totally didn't understand that. Yes, making the signature of that method more flexible is totally fine.

Aug 16 '21 13:08 martindurant

Yes, I think you wanted to react to #658 which is another ticket I wrote this morning.

Aug 16 '21 13:08 yohplala

fastparquet
fastparquet copied to clipboard

ENH: Load the full data (all columns, categories, partitions) with 'read_row_group_file'

fastparquet fastparquet copied to clipboard

ENH: Load the full data (all columns, categories, partitions) with 'read_row_group_file'

fastparquet
fastparquet copied to clipboard