polars
polars copied to clipboard
read_parquet fails for certain filenames.
Checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of Polars.
Reproducible example
I was automatically generating parquet files and reading them with polars and I noticed some issues loading certain files. Turns out they had some quite strange names, and polars was struggling with that. When I renamed them I was able to load them OK.
One example:
a file called 'type..hash.[36]string.parquet'
df = pl.read_parquet("type..hash.[36]string.parquet")
Log output
<module 'polars' from '/home/henry/anaconda3/envs/binnet/lib/python3.10/site-packages/polars/__init__.py'>
>>> dfp = pl.read_parquet("type..hash.[36]string.parquet")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/henry/anaconda3/envs/binnet/lib/python3.10/site-packages/polars/io/parquet/functions.py", line 131, in read_parquet
return pl.DataFrame._read_parquet(
File "/home/henry/anaconda3/envs/binnet/lib/python3.10/site-packages/polars/dataframe/frame.py", line 851, in _read_parquet
return scan.collect()
File "/home/henry/anaconda3/envs/binnet/lib/python3.10/site-packages/polars/utils/deprecation.py", line 100, in wrapper
return function(*args, **kwargs)
File "/home/henry/anaconda3/envs/binnet/lib/python3.10/site-packages/polars/lazyframe/frame.py", line 1788, in collect
return wrap_df(ldf.collect())
polars.exceptions.ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: ComputeError: RecursionError: maximum recursion depth exceeded in comparison
Issue description
Obviously this isn't a good filename and I will try and ensure the files I generate aren't like this, but it should still work (when I load with fastparquet it's fine).
Expected behavior
Dataframe should load.
Installed versions
--------Version info---------
Polars: 0.19.15
Index type: UInt32
Platform: Linux-6.2.0-37-generic-x86_64-with-glibc2.35
Python: 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0]
----Optional dependencies----
adbc_driver_sqlite: <not installed>
cloudpickle: 3.0.0
connectorx: <not installed>
deltalake: <not installed>
fsspec: 2023.10.0
gevent: <not installed>
matplotlib: 3.8.1
numpy: 1.26.1
openpyxl: <not installed>
pandas: 2.1.3
pyarrow: 14.0.1
pydantic: 1.10.13
pyiceberg: <not installed>
pyxlsb: <not installed>
sqlalchemy: 1.4.50
xlsx2csv: <not installed>
xlsxwriter: <not installed>
Yeah, the []
is valid "shell glob" syntax: https://github.com/pola-rs/polars/issues/10106
It seems globbing by default is the more desired behaviour, but perhaps there needs to be a way to disable it?
Yes, agree that we should have a glob
flag.
Also interesting recursion going on in the error message? :thinking:
I looked into the recursion. It seems to happen via scan_parquet
, when the if
statement is true
# try fsspec scanner
if (
can_use_fsspec
and not _is_local_file(source) # type: ignore[arg-type]
and not _is_supported_cloud(source) # type: ignore[arg-type]
):
scan = _scan_parquet_fsspec(source, storage_options) # type: ignore[arg-type]
if n_rows:
scan = scan.head(n_rows)
if row_count_name is not None:
scan = scan.with_row_count(row_count_name, row_count_offset)
return scan # type: ignore[return-value]
The recursion happens as follows:
it starts at:
read_parquet
_read_parquet
scan_parquet
_scan_parquet
_scan_parquet_fsspec
_scan_parquet_impl --> read_parquet
and then we are full circle
@stinodego this is also a bit related to #13040, since this issues exists also for the other types (csv, ipc etc).
Next to that, there is also a bug in the _is_local_file
fn, which makes this recursion bug pops up.
So IMO three things need to happen:
- fix recursion
- fix bug in _is_local_file ** e.g if a bracket '[' is present in the file name, it will always return False (eventhough the file exists locally), since glob.iglob is used which can't handle that special character unless its escaped (which doesn't happen right now).
- Add glob flag (Already PR open for this)
I can do this all in the existing PR, or do you prefer that I split them up?
Ideally those would be separate PRs. Keep 'em small!
I'm still running into this issue with gs://
cloud paths with brackets in the name of the path, even with glob=False
df = pl.read_parquet("gs://foo/[bar]/baz.parquet", glob=False)
@mihirsamdarshi can you open an issue with all information, Then we can pick it up.