fastparquet
fastparquet copied to clipboard
ENH: Load the full data (all columns, categories, partitions) with 'read_row_group_file'
When using read_row_group_file(),
- it is compulsory to provide columns and categories when using
read_row_group_file. - also, the way to specify loading all partitions, or only some of them is not quite clear.
This is shown in below example.
import pandas as pd
import fastparquet as fp
from os import path as os_path
part = [1, 1, 2, 2, 1, 1, 2, 2, 1, 1]
df = pd.DataFrame({'part': part,
'val': range(len(part))})
path = os_path.expanduser('~/Documents/code/test_part')
fp.write(path, df, row_group_offsets=2, file_scheme='hive',
partition_on=['part'])
pf = fp.ParquetFile(path)
# This is not working. Ideally, all data would be loaded (all columns, categories, partitions)
df1 = pf.read_row_group_file(pf.row_groups[0])
Traceback (most recent call last):
File "<ipython-input-17-16f21a2cfbcd>", line 1, in <module>
df1 = pf.read_row_group_file(pf.row_groups[0])
TypeError: read_row_group_file() missing 2 required positional arguments: 'columns' and 'categories'
Also, as said, it is not quite clear how loading partitions.
There is no partitionsparameter, and when seemingly requesting all columns and all categories, the partitionned column is not loaded.
df1 = pf.read_row_group_file(pf.row_groups[0], columns=pf.columns,
categories=pf.categories)
df1
Out[19]:
val
0 0
1 1
Please,
- could a 'simplified' call to
read_row_group_file()be implemented to load all columns and categories and partitions? - how loading partitions as well?
Thanks in advance for your help, Bests,
I feel like this basic work flow could be achieved by dask. Of course, writing separate functionality is perfectly possible, but I wonder if it would end up as extra effort.
I feel like this basic work flow could be achieved by dask. Of course, writing separate functionality is perfectly possible, but I wonder if it would end up as extra effort.
Martin, I do not understand your feedback.
read_row_group_file() is already existing.
Currently, its signature is:
def read_row_group_file(self, rg, columns, categories, index=None,
assign=None, partition_meta=None, row_filter=False,
infile=None):
What I am asking is if we could have instead:
def read_row_group_file(self, rg, columns=None, categories=None, index=None,
assign=None, partition_meta=None, row_filter=False,
infile=None):
in which case, all columns and categories and partitions are loaded.
I totally didn't understand that. Yes, making the signature of that method more flexible is totally fine.
Yes, I think you wanted to react to #658 which is another ticket I wrote this morning.