PanicException on reading Parquet file from S3
Checks
- [X] I have checked that this issue has not already been reported.
- [X] I have confirmed this bug exists on the latest version of Polars.
Reproducible example
parquet_file_name='s3://test-inputs-dataset/GAME_INFO/server_id=3345/date=2023-03-01/hour=06/GAME_INFO_3345_2023-03-01_06.parquet'
df = pl.read_parquet(parquet_file_name, columns=col_list, use_pyarrow=False)
# or
df = pl.scan_parquet(parquet_file_name).collect()
# This works fine: df = pl.scan_parquet(parquet_file_name)
Log output
PanicException Traceback (most recent call last)
File c:\Projects\Proj\scripts\main.py:1
----> 1 df = pl.read_parquet(parquet_file_name, columns=col_list, use_pyarrow=False) # scan_parquet # n_rows=10,
File c:\Users\jm\.conda\envs\env\Lib\site-packages\polars\_utils\deprecation.py:91, in deprecate_renamed_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
86 @wraps(function)
87 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
88 _rename_keyword_argument(
89 old_name, new_name, kwargs, function.__qualname__, version
90 )
---> 91 return function(*args, **kwargs)
File c:\Users\jm\.conda\envs\env\Lib\site-packages\polars\_utils\deprecation.py:91, in deprecate_renamed_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
86 @wraps(function)
87 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
88 _rename_keyword_argument(
89 old_name, new_name, kwargs, function.__qualname__, version
90 )
---> 91 return function(*args, **kwargs)
File c:\Users\jm\.conda\envs\env\Lib\site-packages\polars\io\parquet\functions.py:206, in read_parquet(source, columns, n_rows, row_index_name, row_index_offset, parallel, use_statistics, hive_partitioning, glob, hive_schema, try_parse_hive_dates, rechunk, low_memory, storage_options, retries, use_pyarrow, pyarrow_options, memory_map)
203 else:
204 lf = lf.select(columns)
--> 206 return lf.collect()
...
1939 # Only for testing purposes atm.
1940 callback = _kwargs.get("post_opt_callback")
-> 1942 return wrap_df(ldf.collect(callback))
PanicException: called `Result::unwrap()` on an `Err` value: InvalidHeaderValue
Issue description
Running the read_parquet when using use_pyarrow=False raises PanicException error.
I noticed that the below works OK, when I add use_pyarrow=True, but it seems very slow :
df = pl.read_parquet(parquet_file_name, columns=col_list, use_pyarrow=True)
The file that I am reading is stored in S3. The S3 file path (parquet_file_name) has no spaces in it. If I download the parquet file locally, and open the file from local disk, Polars does not raise any issues.
Also note, before I upgraded polars version to 1.21, I was using version 0.17 and read_parquet did not raise any issues!
Expected behavior
Dataframe read without errors
Installed versions
--------Version info--------- Polars: 1.2.1 Index type: UInt32 Platform: Windows-10-10.0.22621-SP0 Python: 3.11.9 | packaged by Anaconda, Inc. | (main, Apr 19 2024, 16:40:41) [MSC v.1916 64 bit (AMD64)]
----Optional dependencies----
adbc_driver_manager:
You should edit into the title that the issue is reading from cloud. Please run this and post full output, maybe that'll help.
import os
os.environ['RUST_TRACEBACK']='full'
import polars as pl
parquet_file_name=...
df = pl.read_parquet(parquet_file_name, columns=col_list, use_pyarrow=False)
The team treats all panic errors as bugs but it's likely that you need to set storage_options.
See here https://docs.pola.rs/api/python/stable/reference/api/polars.scan_parquet.html#polars-scan-parquet
and here
https://docs.rs/object_store/latest/object_store/aws/enum.AmazonS3ConfigKey.html
The environment variables that pyarrow via fsspec look for aren't 100% in sync with what polars via object_store looks for so that's probably how to fix the issue.
You should edit into the title that the issue is reading from cloud. Please run this and post full output, maybe that'll help.
import os os.environ['RUST_TRACEBACK']='full' import polars as pl parquet_file_name=... df = pl.read_parquet(parquet_file_name, columns=col_list, use_pyarrow=False)The team treats all panic errors as bugs but it's likely that you need to set
storage_options.See here https://docs.pola.rs/api/python/stable/reference/api/polars.scan_parquet.html#polars-scan-parquet
and here
https://docs.rs/object_store/latest/object_store/aws/enum.AmazonS3ConfigKey.html
The environment variables that pyarrow via fsspec look for aren't 100% in sync with what polars via object_store looks for so that's probably how to fix the issue.
Updated as suggested
object_store
As per your suggestion to use storage_options:
# WORKS
df = pl.read_parquet(parquet_file_name, columns=col_list, use_pyarrow=False,
storage_options = {
"aws_access_key_id": os.environ.get('AWS_ACCESS_KEY_ID'),
"aws_secret_access_key": os.environ.get('AWS_SECRET_ACCESS_KEY'),
"aws_region": AWS_REGION})
# DOES NOT WORK
df = pl.read_parquet(parquet_file_name, columns=col_list, use_pyarrow=True,
storage_options = {
"aws_access_key_id": os.environ.get('AWS_ACCESS_KEY_ID'),
"aws_secret_access_key": os.environ.get('AWS_SECRET_ACCESS_KEY'),
"aws_region": AWS_REGION})
The error message is
TypeError: AioSession.init() got an unexpected keyword argument 'aws_access_key_id'
The new error is because there isn't parity between what fsspec expects as key names and what object_store expects between key names. If you set your AWS_REGION as as an env var then does pl.read_parquet(parquet_file_name, columns=col_list,use_pyarrow=False) work? I don't use S3 so I'm just guessing at that.
From here it looks like you should set the environment variable AWS_DEFAULT_REGION to whatever it should be and then I think that pl.read_parquet(parquet_file_name, columns=col_list,use_pyarrow=False) would work.
If you want a minimal reproduction of the panic:
import polars as pl
pl.scan_parquet("s3://foobar/file.parquet").collect()
The new error is because there isn't parity between what fsspec expects as key names and what object_store expects between key names.
@deanm0000 I'm running into this same issue. If I try a CredentialProvider, I end up with "Access Denied" errors from s3fs. Without use_pyarrow I get, "File out of specification: Page content does not align with expected element size" so it doesn't seem I have other options.
Are there plans to fix the fsspec and object_store disparities?
I was able to get this working by changing https://github.com/pola-rs/polars/blob/main/py-polars/polars/io/_utils.py#L229-L230 and https://github.com/pola-rs/polars/blob/main/py-polars/polars/io/_utils.py#L229-L230 from
storage_options["encoding"] = encoding
return fsspec.open(file, **storage_options)
to
# storage_options["encoding"] = encoding -- commenting out just to indicate deletion
return fsspec.open(file, encoding=encoding, client_kwargs=storage_options)
I can make a PR but I don't know enough about fsspec to understand the broader impacts here beyond AWS S3. My gut tells me that this could break other cloud resources and there may need to be an AWS check and instead do something like:
storage_options = {"encoding": encoding, client_kwargs=storage_options}
Would love to hear from someone who knows better 😄
Until this can be handled in the background, I've found that this works without needing to change polars code:
storage_options={
"client_kwargs": {
"aws_access_key_id": credentials.access_key,
"aws_secret_access_key": credentials.secret_key,
"aws_session_token": credentials.token,
"region_name": session.region_name,
}
}
- Closing as the original issue should be fixed on the latest release (1.21.0) by https://github.com/pola-rs/polars/pull/20820.
@lcorcodilos , Could you open a separate issue for the problem you are encountering?