polars read_parquet fails for certain filenames.

Checks

[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of Polars.

Reproducible example

I was automatically generating parquet files and reading them with polars and I noticed some issues loading certain files. Turns out they had some quite strange names, and polars was struggling with that. When I renamed them I was able to load them OK.

One example:

a file called 'type..hash.[36]string.parquet'

df = pl.read_parquet("type..hash.[36]string.parquet")

Log output

<module 'polars' from '/home/henry/anaconda3/envs/binnet/lib/python3.10/site-packages/polars/__init__.py'>
>>> dfp = pl.read_parquet("type..hash.[36]string.parquet")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/henry/anaconda3/envs/binnet/lib/python3.10/site-packages/polars/io/parquet/functions.py", line 131, in read_parquet
    return pl.DataFrame._read_parquet(
  File "/home/henry/anaconda3/envs/binnet/lib/python3.10/site-packages/polars/dataframe/frame.py", line 851, in _read_parquet
    return scan.collect()
  File "/home/henry/anaconda3/envs/binnet/lib/python3.10/site-packages/polars/utils/deprecation.py", line 100, in wrapper
    return function(*args, **kwargs)
  File "/home/henry/anaconda3/envs/binnet/lib/python3.10/site-packages/polars/lazyframe/frame.py", line 1788, in collect
    return wrap_df(ldf.collect())
polars.exceptions.ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: RecursionError: maximum recursion depth exceeded in comparison

Issue description

Obviously this isn't a good filename and I will try and ensure the files I generate aren't like this, but it should still work (when I load with fastparquet it's fine).

Expected behavior

Dataframe should load.

Installed versions

--------Version info---------
Polars:              0.19.15
Index type:          UInt32
Platform:            Linux-6.2.0-37-generic-x86_64-with-glibc2.35
Python:              3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0]

----Optional dependencies----
adbc_driver_sqlite:  <not installed>
cloudpickle:         3.0.0
connectorx:          <not installed>
deltalake:           <not installed>
fsspec:              2023.10.0
gevent:              <not installed>
matplotlib:          3.8.1
numpy:               1.26.1
openpyxl:            <not installed>
pandas:              2.1.3
pyarrow:             14.0.1
pydantic:            1.10.13
pyiceberg:           <not installed>
pyxlsb:              <not installed>
sqlalchemy:          1.4.50
xlsx2csv:            <not installed>
xlsxwriter:          <not installed>

Dec 03 '23 10:12 henrycharlesworth

Yeah, the [] is valid "shell glob" syntax: https://github.com/pola-rs/polars/issues/10106

It seems globbing by default is the more desired behaviour, but perhaps there needs to be a way to disable it?

Dec 03 '23 13:12 cmdlineluser

Yes, agree that we should have a glob flag.

Also interesting recursion going on in the error message? :thinking:

Dec 11 '23 18:12 ritchie46

I looked into the recursion. It seems to happen via scan_parquet, when the if statement is true

 # try fsspec scanner
        if (
            can_use_fsspec
            and not _is_local_file(source)  # type: ignore[arg-type]
            and not _is_supported_cloud(source)  # type: ignore[arg-type]
        ):
            scan = _scan_parquet_fsspec(source, storage_options)  # type: ignore[arg-type]
            if n_rows:
                scan = scan.head(n_rows)
            if row_count_name is not None:
                scan = scan.with_row_count(row_count_name, row_count_offset)
            return scan  # type: ignore[return-value]

The recursion happens as follows:

it starts at:
read_parquet
_read_parquet
scan_parquet
_scan_parquet
_scan_parquet_fsspec
_scan_parquet_impl --> read_parquet

and then we are full circle

@stinodego this is also a bit related to #13040, since this issues exists also for the other types (csv, ipc etc).

Next to that, there is also a bug in the _is_local_file fn, which makes this recursion bug pops up.

So IMO three things need to happen:

fix recursion
fix bug in _is_local_file ** e.g if a bracket '[' is present in the file name, it will always return False (eventhough the file exists locally), since glob.iglob is used which can't handle that special character unless its escaped (which doesn't happen right now).
Add glob flag (Already PR open for this)

I can do this all in the existing PR, or do you prefer that I split them up?

Dec 21 '23 16:12 romanovacca

Ideally those would be separate PRs. Keep 'em small!

Jan 12 '24 11:01 stinodego

I'm still running into this issue with gs:// cloud paths with brackets in the name of the path, even with glob=False

df = pl.read_parquet("gs://foo/[bar]/baz.parquet", glob=False)

Jul 23 '24 07:07 mihirsamdarshi

@mihirsamdarshi can you open an issue with all information, Then we can pick it up.

Jul 23 '24 08:07 ritchie46

polars polars copied to clipboard

read_parquet fails for certain filenames.

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

polars
polars copied to clipboard