Scan_parquet still cannot read hive partitioned parquet files
Checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of Polars.
Reproducible example
import polars as pl
import pyarrow.dataset as ds
import s3fs
buck = 'mybucket'
parq = "path/to/my/partitioned/file"
df = pl.scan_parquet(f"s3://{buck}/{parq}", hive_partitioning=True)
Log output
ComputeError: Object at location path/to/my/partitioned/file not found: Client error with status 404 Not Found: No Body
Issue description
Polars documentation states (https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.scan_parquet.html), that scan_parquet() can read hive partitioned parquet files on AWS, but it cannot: the above code throws an error, while the legacy method works fine, like:
buck = 'mybucket' parq = "path/to/my/partitioned/file"
myds = ds.dataset(f"{buck}/{parq}", filesystem=s3fs.S3FileSystem(), partitioning='hive') df = pl.scan_pyarrow_dataset(myds)
Expected behavior
Scan the parquet file correctly.
Installed versions
--------Version info---------
Polars: 0.19.17
Index type: UInt32
Platform: Linux-5.10.198-187.748.amzn2.x86_64-x86_64-with-glibc2.26
Python: 3.9.18 | packaged by conda-forge | (main, Aug 30 2023, 03:49:32)
[GCC 12.3.0]
----Optional dependencies----
adbc_driver_manager: <not installed>
cloudpickle: <not installed>
connectorx: <not installed>
deltalake: <not installed>
fsspec: 2023.10.0
gevent: <not installed>
matplotlib: <not installed>
numpy: 1.26.0
openpyxl: <not installed>
pandas: 2.1.1
pyarrow: 14.0.1
pydantic: 2.3.0
pyiceberg: <not installed>
pyxlsb: <not installed>
sqlalchemy: 2.0.21
xlsx2csv: <not installed>
xlsxwriter: <not installed>
I was having a similar issue, is your path a glob pattern?
I have to change from
path/to/my/file
to
path/to/my/file/*/*.parquet
where that first * covers the partition.
Maybe if this (https://github.com/pola-rs/polars/issues/14342) enhancement request was fulfilled, it could help in this issue, as well.
It would also be very helpful to hear more about where these functions are intended to be going please: (hive-)partitioned multi-parquet-file datasets are incredibly useful for splitting large-ish data across multiple files using a convention the Arrow tooling understands, but is the intent with these functions in Polars that eventually I should scan such datasets from scan_parquet, scan_pyarrow_dataset, or that both will continue to be recommended for different use cases (e.g for implicit vs explicit file discovery, as currently seems the difference)?
Thanks