polars icon indicating copy to clipboard operation
polars copied to clipboard

read_parquet should fail if a glob pattern results in multiple matches

Open bneijt opened this issue 2 years ago • 1 comments

Problem description

Currently (0.15.7) if I do a read_parquet with a glob pattern (with an s3:// prefix, which might be an issue) and the result has multiple matches, read_parquet will simply read only the first match (whatever that might be, not sure how this is selected).

This makes it really easy to read only a single parquet file while you meant to read multiple (part-0.parquet, part-1.parquet using with /*.parquet). However, you only get the result of a single file!

I would like to have polars throw an error if I get data from a single file from a glob pattern when I have multiple matches to make sure I don't load partial data by accident.

bneijt avatar Dec 20 '22 09:12 bneijt

I think this is related to fsspec. Locally we read all matches of the glob pattern.

ritchie46 avatar Dec 20 '22 10:12 ritchie46

It looks like fsspec.open returns only the first file when getting a glob pattern (path with *). The proper command which can handle glob patterns would be fsspec.open_files which returns a list of files instead.

I think (without testing it!) the problem might be in this if-condition in polars.internals.io._prepare_file_args. It should exclude glob patterns in the filename (e.g. "*" not in file) and the next if-condition should then deal with them instead.

If you want, I can open a PR for this.

JMrziglod avatar Feb 16 '23 13:02 JMrziglod

I also run into that problem. I have created a colab notebook that highlights it and compares the expected behavior with dask: https://colab.research.google.com/drive/1vosWKSZlBun9W8yI9pO3QI5Mww6HNSHi?usp=sharing

I hope that helps with debugging the problem.

tocab avatar Mar 06 '23 09:03 tocab

Same problem here. Current workaround is using pyarrow dataset or something like the code below. But I would prefer a polars native solution.

import s3fs
import polars as pl

fs=s3fs.S3FileSystem()

files = fs.glob("path/to/files/*")

df=pl.concat(
    pl.collect_all(
        [
            pl.scan_parquet(f).filter(...).with_columns(...) for f in files
        ]
    )
)

legout avatar Mar 10 '23 11:03 legout

It looks like fsspec.open returns only the first file when getting a glob pattern (path with *). The proper command which can handle glob patterns would be fsspec.open_files which returns a list of files instead.

I think (without testing it!) the problem might be in this if-condition in polars.internals.io._prepare_file_args. It should exclude glob patterns in the filename (e.g. "*" not in file) and the next if-condition should then deal with them instead.

If you want, I can open a PR for this.

It seems, that there is no open_files in s3fs (fsspec s3 implementation), which is used by polars.

legout avatar Mar 10 '23 11:03 legout

I ran into this issue when dealing with a hive dataset and ended up publishing a one-class package with some of the work-around code to deal hive flavor partitioning in polars.

Best approach to iterate through partitions I could find was using the pyarrow dataset fragment expressions.

Another approach is to use the fsspec ls on top of the s3 driver and use the result to load those files into a dataframe, but for a Hive flavor dataset you will have to make sure you add the partition column values after loading them:

fs = fsspec.filesystem(self.location.scheme)
parquet_files = [
    fragment_location
    for fragment_location in fs.ls(partition_location, detail=False)
    if fragment_location.endswith(".parquet")
]

bneijt avatar Mar 10 '23 15:03 bneijt

I hit this too. FWIW - I think this is probably a bug rather than an enhancement...

PaulRudin avatar Mar 30 '23 11:03 PaulRudin

Closed by #10098

cjackal avatar Jul 30 '23 15:07 cjackal