fastparquet
fastparquet copied to clipboard
Get list of valid parquet files in directory
Every engine handles valid parquet files in a directory differently. PyArrow has this property that allows users to get a list of absolute paths in the Dataset source (see here). Could we do something similar for fastparquet?
Hi @pyrito If i understand correctly, fastparquet already does that, maybe not straight away.
import fastparquet as fp
import pandas as pd
from os import path as os_path
# Example data
df = pd.DataFrame({'a':range(6)})
pq_path = os.path.expanduser('~/Documents/code/data/pq_test')
fp.write(pq_path, df, row_group_offsets=[0,2,4], file_scheme='hive')
pf=fp.ParquetFile(pq_path)
# Get the base path (make sure to prefix it to subsequent file path)
In [19]: pf.basepath
Out[19]: '/home/yoh/Documents/code/data/pq_test'
# Get list of file path
my_path = [rg.columns[0].file_path for rg in pf.row_groups]
In[21]: my_path
Out[21]: ['part.0.parquet', 'part.1.parquet', 'part.2.parquet']
You could also use a built-in function, result will be the same, without duplicates in case several row groups are in the same file.
from fastparquet.api import row_groups_map
rg_map = list(row_groups_map(pf.row_groups).keys())
In[23]: rg_map
Out[23]: ['part.0.parquet', 'part.1.parquet', 'part.2.parquet']
If you are using partitions, partition names will show up in file path as well, as in PyArrow example you provide. Best regards,
PS: if you would like the documentation to state this, please, I am sure a PR about this will be welcome ;)
PPS: if this answers your request, please, feel free to close the ticket
@yohplala thank you for the quick response! I don't think this would work for every case. For example, if I do something like this:
import pandas
import fastparquet
import numpy as np
df = pandas.DataFrame(np.random.randint(0, 100, size=(int(2**18), 2**8))).add_prefix('col')
df.to_parquet("testing/test.parquet")
# This works as expected
df = pandas.read_parquet("testing/test.parquet", engine='fastparquet')
f = fastparquet.ParquetFile("testing/test.parquet")
In [30]: f.basepath
Out[30]: 'testing/test.parquet'
In [28]: my_path = [rg.columns[0].file_path for rg in f.row_groups]
# This should still contain at least `test.parquet`
In [29]: my_path
Out[29]: [None]
The logic is already kind of implemented here: https://github.com/dask/fastparquet/blob/main/fastparquet/api.py#L151-L155
You are right, that is exactly the logic that is used, and I don't mind it being moved or replicated in a utility function. However, fastparquet always allows you to pass a single data file path or list or paths and will in that case read them unmodified, without any filename filter. This is what happens in your example (the .file_path attributes are pointers from the root directory of a dataset, but in this case there is no directory).
@martindurant that makes sense. You mention an important caveat but I think it would still be helpful to have it saved as an attribute or called through another function.
Note that for the single-file case, you do have the path available as the .fn attribute. In the case of multi-file datasets, this will be the effective root of the dataset.