fastparquet icon indicating copy to clipboard operation
fastparquet copied to clipboard

Get list of valid parquet files in directory

Open pyrito opened this issue 3 years ago • 7 comments

Every engine handles valid parquet files in a directory differently. PyArrow has this property that allows users to get a list of absolute paths in the Dataset source (see here). Could we do something similar for fastparquet?

pyrito avatar Aug 10 '22 13:08 pyrito

Hi @pyrito If i understand correctly, fastparquet already does that, maybe not straight away.

import fastparquet as fp
import pandas as pd
from os import path as os_path

# Example data
df = pd.DataFrame({'a':range(6)})
pq_path = os.path.expanduser('~/Documents/code/data/pq_test')
fp.write(pq_path, df, row_group_offsets=[0,2,4], file_scheme='hive')
pf=fp.ParquetFile(pq_path)

# Get the base path (make sure to prefix it to subsequent file path)
In [19]: pf.basepath
Out[19]: '/home/yoh/Documents/code/data/pq_test'

# Get list of file path
my_path = [rg.columns[0].file_path for rg in pf.row_groups]
    
In[21]: my_path
Out[21]: ['part.0.parquet', 'part.1.parquet', 'part.2.parquet']

You could also use a built-in function, result will be the same, without duplicates in case several row groups are in the same file.

from fastparquet.api import row_groups_map

rg_map = list(row_groups_map(pf.row_groups).keys())

In[23]: rg_map
Out[23]: ['part.0.parquet', 'part.1.parquet', 'part.2.parquet']

If you are using partitions, partition names will show up in file path as well, as in PyArrow example you provide. Best regards,

PS: if you would like the documentation to state this, please, I am sure a PR about this will be welcome ;)

yohplala avatar Aug 10 '22 14:08 yohplala

PPS: if this answers your request, please, feel free to close the ticket

yohplala avatar Aug 10 '22 14:08 yohplala

@yohplala thank you for the quick response! I don't think this would work for every case. For example, if I do something like this:

import pandas
import fastparquet
import numpy as np

df = pandas.DataFrame(np.random.randint(0, 100, size=(int(2**18), 2**8))).add_prefix('col')
df.to_parquet("testing/test.parquet")
# This works as expected
df = pandas.read_parquet("testing/test.parquet", engine='fastparquet')

f = fastparquet.ParquetFile("testing/test.parquet")

In [30]: f.basepath
Out[30]: 'testing/test.parquet'

In [28]: my_path = [rg.columns[0].file_path for rg in f.row_groups]

# This should still contain at least `test.parquet`
In [29]: my_path
Out[29]: [None]

pyrito avatar Aug 10 '22 14:08 pyrito

The logic is already kind of implemented here: https://github.com/dask/fastparquet/blob/main/fastparquet/api.py#L151-L155

pyrito avatar Aug 10 '22 14:08 pyrito

You are right, that is exactly the logic that is used, and I don't mind it being moved or replicated in a utility function. However, fastparquet always allows you to pass a single data file path or list or paths and will in that case read them unmodified, without any filename filter. This is what happens in your example (the .file_path attributes are pointers from the root directory of a dataset, but in this case there is no directory).

martindurant avatar Aug 10 '22 14:08 martindurant

@martindurant that makes sense. You mention an important caveat but I think it would still be helpful to have it saved as an attribute or called through another function.

pyrito avatar Aug 10 '22 15:08 pyrito

Note that for the single-file case, you do have the path available as the .fn attribute. In the case of multi-file datasets, this will be the effective root of the dataset.

martindurant avatar Aug 11 '22 20:08 martindurant