polars icon indicating copy to clipboard operation
polars copied to clipboard

Scan_parquet still cannot read hive partitioned parquet files

Open lmocsi opened this issue 2 years ago • 3 comments

Checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
import pyarrow.dataset as ds
import s3fs

buck = 'mybucket'
parq = "path/to/my/partitioned/file"

df = pl.scan_parquet(f"s3://{buck}/{parq}", hive_partitioning=True) 

Log output

ComputeError: Object at location path/to/my/partitioned/file not found: Client error with status 404 Not Found: No Body

Issue description

Polars documentation states (https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.scan_parquet.html), that scan_parquet() can read hive partitioned parquet files on AWS, but it cannot: the above code throws an error, while the legacy method works fine, like:

buck = 'mybucket' parq = "path/to/my/partitioned/file"

myds = ds.dataset(f"{buck}/{parq}", filesystem=s3fs.S3FileSystem(), partitioning='hive') df = pl.scan_pyarrow_dataset(myds)

Expected behavior

Scan the parquet file correctly.

Installed versions

--------Version info---------
Polars:               0.19.17
Index type:           UInt32
Platform:             Linux-5.10.198-187.748.amzn2.x86_64-x86_64-with-glibc2.26
Python:               3.9.18 | packaged by conda-forge | (main, Aug 30 2023, 03:49:32) 
[GCC 12.3.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fsspec:               2023.10.0
gevent:               <not installed>
matplotlib:           <not installed>
numpy:                1.26.0
openpyxl:             <not installed>
pandas:               2.1.1
pyarrow:              14.0.1
pydantic:             2.3.0
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           2.0.21
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>

lmocsi avatar Nov 29 '23 15:11 lmocsi

I was having a similar issue, is your path a glob pattern?

I have to change from path/to/my/file to path/to/my/file/*/*.parquet where that first * covers the partition.

j-hartshorn avatar Nov 29 '23 17:11 j-hartshorn

Maybe if this (https://github.com/pola-rs/polars/issues/14342) enhancement request was fulfilled, it could help in this issue, as well.

lmocsi avatar Feb 22 '24 15:02 lmocsi

It would also be very helpful to hear more about where these functions are intended to be going please: (hive-)partitioned multi-parquet-file datasets are incredibly useful for splitting large-ish data across multiple files using a convention the Arrow tooling understands, but is the intent with these functions in Polars that eventually I should scan such datasets from scan_parquet, scan_pyarrow_dataset, or that both will continue to be recommended for different use cases (e.g for implicit vs explicit file discovery, as currently seems the difference)?

Thanks

tmct avatar Feb 22 '24 22:02 tmct