polars
polars copied to clipboard
Performance degradation using Rust-native parquet reader from AWS S3 for a dataframe with 12,000 columns.
Checks
- [X] I have checked that this issue has not already been reported.
- [X] I have confirmed this bug exists on the latest version of Polars.
Reproducible example
Import boto3
import os
import numpy as np
import polars as pl
import pyarrow.dataset as ds
import pyarrow.fs as fs
import pyarrow.parquet as pq
# create and store dataframe on AWS S3
large_num_cols = 12_000
df = pl.DataFrame({f"col_{i}": np.random.normal(loc=0.0, scale=1.0, size=24_000) for i in range(large_num_cols)})
os.environ["AWS_PROFILE"] = "my_profile"
pyfs = fs.S3FileSystem(retry_strategy=fs.AwsStandardS3RetryStrategy(max_attempts=10))
df.write_parquet(
file="parquet-writers-test/test_large_df.pq",
compression="lz4",
use_pyarrow=True,
pyarrow_options={"filesystem": pyfs},
)
# read parquet with Pyarrow – around 6 seconds
df1 = pl.read_parquet(
source="parquet-writers-test/test_large_df.pq",
use_pyarrow=True,
pyarrow_options={"filesystem": pyfs},
)
# read parquet with Rust-native – around 50 seconds
session = boto3.Session(profile_name="my_profile")
credentials = session.get_credentials().get_frozen_credentials()
storage_options = {
"aws_access_key_id": credentials.access_key,
"aws_secret_access_key": credentials.secret_key,
"aws_session_token": credentials.token,
"aws_region": session.region_name,
}
source = "s3://parquet-writers-test/test_large_df.pq"
df2 = pl.scan_parquet(source, storage_options=storage_options)
# read scan parquet dataset with Pyarrow – around 6 seconds
pyds = ds.dataset(
source="parquet-writers-test/test_large_df.pq",
filesystem=pyfs,
format="parquet",
)
df3 = pl.scan_pyarrow_dataset(pyds).collect()
# scan parquet with Rust-native
df4 = pl.scan_parquet(source, storage_options=storage_options).collect()
Log output
No response
Issue description
We observed a significant degradation in speed reading parquet with 12,000 columns file from AWS S3 bucket when using Rust-native parquet reader in comparison to Pyarrow native implementation:
read_parquet (with Pyarrow filesystem) – around 6 seconds read_parquet (with Rust-native) – around 50 seconds scan_pyarrow_dataset (with Pyarrow filesystem) – around 6 seconds scan_parquet (with Rust-native) – 100 seconds
Expected behavior
This would be less of an issue if scan_parquet allowed to use Pyarrow filesystem, similar to read_parquet and scan_pyarrow_dataset
Installed versions
--------Version info---------
Polars: 1.5.0
Index type: UInt32
Platform: Linux-4.18.0-513.18.1.el8_9.x86_64-x86_64-with-glibc2.28
Python: 3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:36:13) [GCC 12.3.0]
----Optional dependencies----
adbc_driver_manager: <not installed>
cloudpickle: 2.2.1
connectorx: <not installed>
deltalake: 0.18.2
fastexcel: <not installed>
fsspec: 2024.6.1
gevent: <not installed>
great_tables: 0.10.0
hvplot: 0.10.0
matplotlib: 3.9.1
nest_asyncio: 1.6.0
numpy: 1.24.4
openpyxl: 3.1.5
pandas: 2.1.4
pyarrow: 16.1.0
pydantic: 2.8.2
pyiceberg: <not installed>
sqlalchemy: 2.0.32
torch: 2.3.1.post100
xlsx2csv: <not installed>
xlsxwriter: <not installed>
1.6 was just released which contains a fix for:
- https://github.com/pola-rs/polars/issues/18319
It sounds like it could be the same issue you're describing.
This and we will improve it even more as there are still a few places where we are linear when we can be O(1).
Hi @cmdlineluser, thanks for pointing out to the existing issue and new Polars version. Hi @ritchie46, as always very much appreciate your commitment to timely address issues and keep Polars best in class - it's really great to be part of this community!
I upgraded to version 1.6 and definitely see the improvement:
read_parquet (with Rust-native) – in Polars 1.6.0 takes 30 seconds versus 50 seconds in version 1.5.0 scan_parquet (with Rust-native) – in Polars 1.6.0 takes 30 seconds versus 100 seconds in version 1.5.0
I'm definitely looking forward to further improvements to Rust-native reader to bring it in line with Pyarrow which is still faster.
@ritchie46 are there plans to make scan_parquet to accept optional Pyarrow filesystem or is it design decision to support Rust-native only?
Thank you!
We will not plan to take pyarrow file system in our native readers. We do support pyarrow datasets as scan functions.
The performance of very wide parquet files will further improve by @nameexhaustion's upcoming schema unification and metadata supertype`. This issue is on our roadmap.
@ritchie46 pinging as discussed...
Yes. We finished the schema updates. I expect this to also resolve more quadratic behavior. Curious on a run on latest.
Hi @ritchie46 @ptomecek
I just tried the same test as above with polars 1.12.0 and still see fairly similar numbers
Test 1: read parquet a. with pyarrow filesystem - 7 seconds b. with rust-native - 40 seconds
Test 2: scan parquet + collect a. with pyarrow filesystem - 8 seconds b. with rust-native - 30 seconds
Thank you!
Alright.. thanks. Will see if we can get a profile on this.
@vstolin can you do another run with Polars 1.13?
Hi @ritchie46 I just tested 1.13 and performance improved a lot - they are practically on par.
Test 1: read parquet a. with pyarrow filesystem - 6 seconds b. with rust-native - 6 seconds
Test 2: scan parquet + collect a. with pyarrow filesystem - 8 seconds b. with rust-native - 8 seconds
Thank you for fast turn-around!
@ritchie46 - Also great job on making AWS profile work! This makes AWS credential_process with session token so much easier!
Great to hear! :)
Then I'll close this as fixed. :+1: