polars icon indicating copy to clipboard operation
polars copied to clipboard

Performance degradation using Rust-native parquet reader from AWS S3 for a dataframe with 12,000 columns.

Open vstolin opened this issue 1 year ago • 1 comments

Checks

  • [X] I have checked that this issue has not already been reported.
  • [X] I have confirmed this bug exists on the latest version of Polars.

Reproducible example

Import boto3
import os
import numpy as np
import polars as pl
import pyarrow.dataset as ds
import pyarrow.fs as fs
import pyarrow.parquet as pq

# create and store dataframe on AWS S3
large_num_cols = 12_000
df = pl.DataFrame({f"col_{i}": np.random.normal(loc=0.0, scale=1.0, size=24_000) for i in range(large_num_cols)})
os.environ["AWS_PROFILE"] = "my_profile"
pyfs = fs.S3FileSystem(retry_strategy=fs.AwsStandardS3RetryStrategy(max_attempts=10))

df.write_parquet(
    file="parquet-writers-test/test_large_df.pq",
    compression="lz4",
    use_pyarrow=True,
    pyarrow_options={"filesystem": pyfs},
)

# read parquet with Pyarrow – around 6 seconds
df1 = pl.read_parquet(
    source="parquet-writers-test/test_large_df.pq",
    use_pyarrow=True,
    pyarrow_options={"filesystem": pyfs},
)

# read parquet with Rust-native – around 50 seconds
session = boto3.Session(profile_name="my_profile")
credentials = session.get_credentials().get_frozen_credentials()
storage_options = {
    "aws_access_key_id": credentials.access_key,
    "aws_secret_access_key": credentials.secret_key,
    "aws_session_token": credentials.token,
    "aws_region": session.region_name,
}
source = "s3://parquet-writers-test/test_large_df.pq"
df2 = pl.scan_parquet(source, storage_options=storage_options)

# read scan parquet dataset with Pyarrow – around 6 seconds
pyds = ds.dataset(
    source="parquet-writers-test/test_large_df.pq",
    filesystem=pyfs,
    format="parquet",
)
df3 = pl.scan_pyarrow_dataset(pyds).collect()

# scan parquet with Rust-native
df4 = pl.scan_parquet(source, storage_options=storage_options).collect()

Log output

No response

Issue description

We observed a significant degradation in speed reading parquet with 12,000 columns file from AWS S3 bucket when using Rust-native parquet reader in comparison to Pyarrow native implementation:

read_parquet (with Pyarrow filesystem) – around 6 seconds read_parquet (with Rust-native) – around 50 seconds scan_pyarrow_dataset (with Pyarrow filesystem) – around 6 seconds scan_parquet (with Rust-native) – 100 seconds

Expected behavior

This would be less of an issue if scan_parquet allowed to use Pyarrow filesystem, similar to read_parquet and scan_pyarrow_dataset

Installed versions

--------Version info---------
Polars:               1.5.0
Index type:           UInt32
Platform:             Linux-4.18.0-513.18.1.el8_9.x86_64-x86_64-with-glibc2.28
Python:               3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:36:13) [GCC 12.3.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          2.2.1
connectorx:           <not installed>
deltalake:            0.18.2
fastexcel:            <not installed>
fsspec:               2024.6.1
gevent:               <not installed>
great_tables:         0.10.0
hvplot:               0.10.0
matplotlib:           3.9.1
nest_asyncio:         1.6.0
numpy:                1.24.4
openpyxl:             3.1.5
pandas:               2.1.4
pyarrow:              16.1.0
pydantic:             2.8.2
pyiceberg:            <not installed>
sqlalchemy:           2.0.32
torch:                2.3.1.post100
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>

vstolin avatar Aug 28 '24 16:08 vstolin

1.6 was just released which contains a fix for:

  • https://github.com/pola-rs/polars/issues/18319

It sounds like it could be the same issue you're describing.

cmdlineluser avatar Aug 28 '24 19:08 cmdlineluser

This and we will improve it even more as there are still a few places where we are linear when we can be O(1).

ritchie46 avatar Aug 29 '24 09:08 ritchie46

Hi @cmdlineluser, thanks for pointing out to the existing issue and new Polars version. Hi @ritchie46, as always very much appreciate your commitment to timely address issues and keep Polars best in class - it's really great to be part of this community!

I upgraded to version 1.6 and definitely see the improvement:

read_parquet (with Rust-native) – in Polars 1.6.0 takes 30 seconds versus 50 seconds in version 1.5.0 scan_parquet (with Rust-native) – in Polars 1.6.0 takes 30 seconds versus 100 seconds in version 1.5.0

I'm definitely looking forward to further improvements to Rust-native reader to bring it in line with Pyarrow which is still faster.

@ritchie46 are there plans to make scan_parquet to accept optional Pyarrow filesystem or is it design decision to support Rust-native only?

Thank you!

vstolin avatar Aug 29 '24 13:08 vstolin

We will not plan to take pyarrow file system in our native readers. We do support pyarrow datasets as scan functions.

The performance of very wide parquet files will further improve by @nameexhaustion's upcoming schema unification and metadata supertype`. This issue is on our roadmap.

ritchie46 avatar Aug 29 '24 13:08 ritchie46

@ritchie46 pinging as discussed...

ptomecek avatar Nov 07 '24 21:11 ptomecek

Yes. We finished the schema updates. I expect this to also resolve more quadratic behavior. Curious on a run on latest.

ritchie46 avatar Nov 07 '24 21:11 ritchie46

Hi @ritchie46 @ptomecek

I just tried the same test as above with polars 1.12.0 and still see fairly similar numbers

Test 1: read parquet a. with pyarrow filesystem - 7 seconds b. with rust-native - 40 seconds

Test 2: scan parquet + collect a. with pyarrow filesystem - 8 seconds b. with rust-native - 30 seconds

Thank you!

vstolin avatar Nov 07 '24 23:11 vstolin

Alright.. thanks. Will see if we can get a profile on this.

ritchie46 avatar Nov 07 '24 23:11 ritchie46

@vstolin can you do another run with Polars 1.13?

ritchie46 avatar Nov 12 '24 13:11 ritchie46

Hi @ritchie46 I just tested 1.13 and performance improved a lot - they are practically on par.

Test 1: read parquet a. with pyarrow filesystem - 6 seconds b. with rust-native - 6 seconds

Test 2: scan parquet + collect a. with pyarrow filesystem - 8 seconds b. with rust-native - 8 seconds

Thank you for fast turn-around!

vstolin avatar Nov 12 '24 14:11 vstolin

@ritchie46 - Also great job on making AWS profile work! This makes AWS credential_process with session token so much easier!

vstolin avatar Nov 12 '24 14:11 vstolin

Great to hear! :)

Then I'll close this as fixed. :+1:

ritchie46 avatar Nov 12 '24 14:11 ritchie46