polars icon indicating copy to clipboard operation
polars copied to clipboard

scan_parquet returns ComputeError if there are no parquet files

Open ddutt opened this issue 6 months ago • 5 comments

Checks

  • [X] I have checked that this issue has not already been reported.
  • [X] I have confirmed this bug exists on the latest version of Polars.

Reproducible example

`First create an empty hive-partitioned dir:

mkdir -p "/tmp/foo/bar=bat"

Then execute:

pl.scan_parquet('/tmp/foo/**/*.parquet', hive_partitioning=True, rechunk=False).collect()

Log output

No response

Issue description

When i upgraded to polars 1.5.0 recently, I found a scan_parquet behavior that is quite painful and slow compared to pyarrow.

We use scan_parquet over a bunch of deeply nested hive-partitioned folder structure. Sometimes, there maybe no parquet file in that deeply nested structure. polars in 0.20.x (.19 for sure, maybe even .31), this caused collect() to silently return an empty dataframe. With polars 1.5.0, I get a ComputeError exception. I'm actually reading bunch of such dirs and concat the lazyframes, and doing a collect at the very end. As a consequece, 1.5.0 errors out with ComputeError, failing the entire `collect()``.

pyarrow returns an empty Table (in 45 us for one such very deeply nested folder; 1300 subfolders). If i attempt to use collect_schema() as a way to catch this before adding it to the list of lazyframes to concat, on the same deeply nested folder it takes 16ms. If I do scan_pyarrow_dataset().collect_schema() it takes 1.47us.

Expected behavior

The scan_parquet call returns an empty dataframe for that lazyframe. If there are multiple lazyframes being concat via pl.concat, then the effect of collect on the concat lazyframes should be to ignore the empty dataframe or else you'll error out with mismatch column or schema mismatch.

Installed versions

--------Version info---------
Polars:               1.4.1
Index type:           UInt32
Platform:             Linux-6.5.0-45-generic-x86_64-with-glibc2.35
Python:               3.11.6 (main, Oct 13 2023, 14:12:02) [GCC 11.4.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
great_tables:         <not installed>
hvplot:               <not installed>
matplotlib:           <not installed>
nest_asyncio:         <not installed>
numpy:                1.26.3
openpyxl:             <not installed>
pandas:               2.2.2
pyarrow:              16.1.0
pydantic:             2.6.0
pyiceberg:            <not installed>
sqlalchemy:           2.0.25
torch:                <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>

ddutt avatar Aug 27 '24 01:08 ddutt