polars
polars copied to clipboard
scan_parquet returns ComputeError if there are no parquet files
Checks
- [X] I have checked that this issue has not already been reported.
- [X] I have confirmed this bug exists on the latest version of Polars.
Reproducible example
`First create an empty hive-partitioned dir:
mkdir -p "/tmp/foo/bar=bat"
Then execute:
pl.scan_parquet('/tmp/foo/**/*.parquet', hive_partitioning=True, rechunk=False).collect()
Log output
No response
Issue description
When i upgraded to polars 1.5.0 recently, I found a scan_parquet
behavior that is quite painful and slow compared to pyarrow.
We use scan_parquet over a bunch of deeply nested hive-partitioned folder structure. Sometimes, there maybe no parquet file in that deeply nested structure. polars in 0.20.x (.19 for sure, maybe even .31), this caused collect() to silently return an empty dataframe. With polars 1.5.0, I get a ComputeError
exception. I'm actually reading bunch of such dirs and concat the lazyframes, and doing a collect at the very end. As a consequece, 1.5.0 errors out with ComputeError
, failing the entire `collect()``.
pyarrow returns an empty Table (in 45 us for one such very deeply nested folder; 1300 subfolders). If i attempt to use collect_schema()
as a way to catch this before adding it to the list of lazyframes to concat, on the same deeply nested folder it takes 16ms. If I do scan_pyarrow_dataset().collect_schema()
it takes 1.47us.
Expected behavior
The scan_parquet
call returns an empty dataframe for that lazyframe. If there are multiple lazyframes being concat via pl.concat
, then the effect of collect
on the concat lazyframes should be to ignore the empty dataframe or else you'll error out with mismatch column or schema mismatch.
Installed versions
--------Version info---------
Polars: 1.4.1
Index type: UInt32
Platform: Linux-6.5.0-45-generic-x86_64-with-glibc2.35
Python: 3.11.6 (main, Oct 13 2023, 14:12:02) [GCC 11.4.0]
----Optional dependencies----
adbc_driver_manager: <not installed>
cloudpickle: <not installed>
connectorx: <not installed>
deltalake: <not installed>
fastexcel: <not installed>
fsspec: <not installed>
gevent: <not installed>
great_tables: <not installed>
hvplot: <not installed>
matplotlib: <not installed>
nest_asyncio: <not installed>
numpy: 1.26.3
openpyxl: <not installed>
pandas: 2.2.2
pyarrow: 16.1.0
pydantic: 2.6.0
pyiceberg: <not installed>
sqlalchemy: 2.0.25
torch: <not installed>
xlsx2csv: <not installed>
xlsxwriter: <not installed>