polars read_parquet should fail if a glob pattern results in multiple matches

Problem description

Currently (0.15.7) if I do a read_parquet with a glob pattern (with an s3:// prefix, which might be an issue) and the result has multiple matches, read_parquet will simply read only the first match (whatever that might be, not sure how this is selected).

This makes it really easy to read only a single parquet file while you meant to read multiple (part-0.parquet, part-1.parquet using with /*.parquet). However, you only get the result of a single file!

I would like to have polars throw an error if I get data from a single file from a glob pattern when I have multiple matches to make sure I don't load partial data by accident.

Dec 20 '22 09:12 bneijt

I think this is related to fsspec. Locally we read all matches of the glob pattern.

Dec 20 '22 10:12 ritchie46

It looks like fsspec.open returns only the first file when getting a glob pattern (path with *). The proper command which can handle glob patterns would be fsspec.open_files which returns a list of files instead.

I think (without testing it!) the problem might be in this if-condition in polars.internals.io._prepare_file_args. It should exclude glob patterns in the filename (e.g. "*" not in file) and the next if-condition should then deal with them instead.

If you want, I can open a PR for this.

Feb 16 '23 13:02 JMrziglod

I also run into that problem. I have created a colab notebook that highlights it and compares the expected behavior with dask: https://colab.research.google.com/drive/1vosWKSZlBun9W8yI9pO3QI5Mww6HNSHi?usp=sharing

I hope that helps with debugging the problem.

Mar 06 '23 09:03 tocab

Same problem here. Current workaround is using pyarrow dataset or something like the code below. But I would prefer a polars native solution.

import s3fs
import polars as pl

fs=s3fs.S3FileSystem()

files = fs.glob("path/to/files/*")

df=pl.concat(
    pl.collect_all(
        [
            pl.scan_parquet(f).filter(...).with_columns(...) for f in files
        ]
    )
)

Mar 10 '23 11:03 legout

It looks like fsspec.open returns only the first file when getting a glob pattern (path with *). The proper command which can handle glob patterns would be fsspec.open_files which returns a list of files instead.

I think (without testing it!) the problem might be in this if-condition in polars.internals.io._prepare_file_args. It should exclude glob patterns in the filename (e.g. "*" not in file) and the next if-condition should then deal with them instead.

If you want, I can open a PR for this.

It seems, that there is no open_files in s3fs (fsspec s3 implementation), which is used by polars.

Mar 10 '23 11:03 legout

I ran into this issue when dealing with a hive dataset and ended up publishing a one-class package with some of the work-around code to deal hive flavor partitioning in polars.

Best approach to iterate through partitions I could find was using the pyarrow dataset fragment expressions.

Another approach is to use the fsspec ls on top of the s3 driver and use the result to load those files into a dataframe, but for a Hive flavor dataset you will have to make sure you add the partition column values after loading them:

fs = fsspec.filesystem(self.location.scheme)
parquet_files = [
    fragment_location
    for fragment_location in fs.ls(partition_location, detail=False)
    if fragment_location.endswith(".parquet")
]

Mar 10 '23 15:03 bneijt

I hit this too. FWIW - I think this is probably a bug rather than an enhancement...

Mar 30 '23 11:03 PaulRudin

Closed by #10098

Jul 30 '23 15:07 cjackal

polars polars copied to clipboard

read_parquet should fail if a glob pattern results in multiple matches

Problem description

polars
polars copied to clipboard