polars
polars copied to clipboard
read_parquet should fail if a glob pattern results in multiple matches
Problem description
Currently (0.15.7) if I do a read_parquet
with a glob pattern (with an s3:// prefix, which might be an issue) and the result has multiple matches, read_parquet
will simply read only the first match (whatever that might be, not sure how this is selected).
This makes it really easy to read only a single parquet file while you meant to read multiple (part-0.parquet
, part-1.parquet
using with /*.parquet
). However, you only get the result of a single file!
I would like to have polars throw an error if I get data from a single file from a glob pattern when I have multiple matches to make sure I don't load partial data by accident.
I think this is related to fsspec. Locally we read all matches of the glob pattern.
It looks like fsspec.open
returns only the first file when getting a glob pattern (path with *). The proper command which can handle glob patterns would be fsspec.open_files
which returns a list of files instead.
I think (without testing it!) the problem might be in this if-condition in polars.internals.io._prepare_file_args
. It should exclude glob patterns in the filename (e.g. "*" not in file
) and the next if-condition should then deal with them instead.
If you want, I can open a PR for this.
I also run into that problem. I have created a colab notebook that highlights it and compares the expected behavior with dask: https://colab.research.google.com/drive/1vosWKSZlBun9W8yI9pO3QI5Mww6HNSHi?usp=sharing
I hope that helps with debugging the problem.
Same problem here. Current workaround is using pyarrow dataset or something like the code below. But I would prefer a polars native solution.
import s3fs
import polars as pl
fs=s3fs.S3FileSystem()
files = fs.glob("path/to/files/*")
df=pl.concat(
pl.collect_all(
[
pl.scan_parquet(f).filter(...).with_columns(...) for f in files
]
)
)
It looks like
fsspec.open
returns only the first file when getting a glob pattern (path with *). The proper command which can handle glob patterns would befsspec.open_files
which returns a list of files instead.I think (without testing it!) the problem might be in this if-condition in
polars.internals.io._prepare_file_args
. It should exclude glob patterns in the filename (e.g."*" not in file
) and the next if-condition should then deal with them instead.If you want, I can open a PR for this.
It seems, that there is no open_files
in s3fs (fsspec s3 implementation), which is used by polars.
I ran into this issue when dealing with a hive dataset and ended up publishing a one-class package with some of the work-around code to deal hive flavor partitioning in polars.
Best approach to iterate through partitions I could find was using the pyarrow dataset fragment expressions.
Another approach is to use the fsspec ls on top of the s3 driver and use the result to load those files into a dataframe, but for a Hive flavor dataset you will have to make sure you add the partition column values after loading them:
fs = fsspec.filesystem(self.location.scheme)
parquet_files = [
fragment_location
for fragment_location in fs.ls(partition_location, detail=False)
if fragment_location.endswith(".parquet")
]
I hit this too. FWIW - I think this is probably a bug rather than an enhancement...
Closed by #10098