polars icon indicating copy to clipboard operation
polars copied to clipboard

read_parquet fails for certain filenames.

Open henrycharlesworth opened this issue 1 year ago • 3 comments

Checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of Polars.

Reproducible example

I was automatically generating parquet files and reading them with polars and I noticed some issues loading certain files. Turns out they had some quite strange names, and polars was struggling with that. When I renamed them I was able to load them OK.

One example:

a file called 'type..hash.[36]string.parquet'

df = pl.read_parquet("type..hash.[36]string.parquet")

Log output

<module 'polars' from '/home/henry/anaconda3/envs/binnet/lib/python3.10/site-packages/polars/__init__.py'>
>>> dfp = pl.read_parquet("type..hash.[36]string.parquet")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/henry/anaconda3/envs/binnet/lib/python3.10/site-packages/polars/io/parquet/functions.py", line 131, in read_parquet
    return pl.DataFrame._read_parquet(
  File "/home/henry/anaconda3/envs/binnet/lib/python3.10/site-packages/polars/dataframe/frame.py", line 851, in _read_parquet
    return scan.collect()
  File "/home/henry/anaconda3/envs/binnet/lib/python3.10/site-packages/polars/utils/deprecation.py", line 100, in wrapper
    return function(*args, **kwargs)
  File "/home/henry/anaconda3/envs/binnet/lib/python3.10/site-packages/polars/lazyframe/frame.py", line 1788, in collect
    return wrap_df(ldf.collect())
polars.exceptions.ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: RecursionError: maximum recursion depth exceeded in comparison

Issue description

Obviously this isn't a good filename and I will try and ensure the files I generate aren't like this, but it should still work (when I load with fastparquet it's fine).

Expected behavior

Dataframe should load.

Installed versions

--------Version info---------
Polars:              0.19.15
Index type:          UInt32
Platform:            Linux-6.2.0-37-generic-x86_64-with-glibc2.35
Python:              3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0]

----Optional dependencies----
adbc_driver_sqlite:  <not installed>
cloudpickle:         3.0.0
connectorx:          <not installed>
deltalake:           <not installed>
fsspec:              2023.10.0
gevent:              <not installed>
matplotlib:          3.8.1
numpy:               1.26.1
openpyxl:            <not installed>
pandas:              2.1.3
pyarrow:             14.0.1
pydantic:            1.10.13
pyiceberg:           <not installed>
pyxlsb:              <not installed>
sqlalchemy:          1.4.50
xlsx2csv:            <not installed>
xlsxwriter:          <not installed>

henrycharlesworth avatar Dec 03 '23 10:12 henrycharlesworth

Yeah, the [] is valid "shell glob" syntax: https://github.com/pola-rs/polars/issues/10106

It seems globbing by default is the more desired behaviour, but perhaps there needs to be a way to disable it?

cmdlineluser avatar Dec 03 '23 13:12 cmdlineluser

Yes, agree that we should have a glob flag.

Also interesting recursion going on in the error message? :thinking:

ritchie46 avatar Dec 11 '23 18:12 ritchie46

I looked into the recursion. It seems to happen via scan_parquet, when the if statement is true

 # try fsspec scanner
        if (
            can_use_fsspec
            and not _is_local_file(source)  # type: ignore[arg-type]
            and not _is_supported_cloud(source)  # type: ignore[arg-type]
        ):
            scan = _scan_parquet_fsspec(source, storage_options)  # type: ignore[arg-type]
            if n_rows:
                scan = scan.head(n_rows)
            if row_count_name is not None:
                scan = scan.with_row_count(row_count_name, row_count_offset)
            return scan  # type: ignore[return-value]

The recursion happens as follows:

it starts at:
read_parquet
_read_parquet
scan_parquet
_scan_parquet
_scan_parquet_fsspec
_scan_parquet_impl --> read_parquet 

and then we are full circle

@stinodego this is also a bit related to #13040, since this issues exists also for the other types (csv, ipc etc).

Next to that, there is also a bug in the _is_local_file fn, which makes this recursion bug pops up.

So IMO three things need to happen:

  • fix recursion
  • fix bug in _is_local_file ** e.g if a bracket '[' is present in the file name, it will always return False (eventhough the file exists locally), since glob.iglob is used which can't handle that special character unless its escaped (which doesn't happen right now).
  • Add glob flag (Already PR open for this)

I can do this all in the existing PR, or do you prefer that I split them up?

romanovacca avatar Dec 21 '23 16:12 romanovacca

Ideally those would be separate PRs. Keep 'em small!

stinodego avatar Jan 12 '24 11:01 stinodego

I'm still running into this issue with gs:// cloud paths with brackets in the name of the path, even with glob=False

df = pl.read_parquet("gs://foo/[bar]/baz.parquet", glob=False)

mihirsamdarshi avatar Jul 23 '24 07:07 mihirsamdarshi

@mihirsamdarshi can you open an issue with all information, Then we can pick it up.

ritchie46 avatar Jul 23 '24 08:07 ritchie46