smart_open Reading parquet using smart_open+pandas is 3x slower than pandas

trafficstars

Problem description

Reading a parquet file from S3 with smart_open + pandas + pyarrow is seriously slower (3x) than if using just pandas + pyarrow. I independently tried optimizing buffering and buffer_size with no luck.

Steps/code to reproduce the problem

import datetime
import timeit

import boto3
import pandas as pd
import pyarrow
import s3path
import smart_open

PARQUET_URI_IN = "s3://PLEASE-USE-YOUR/OWN/FILE.parquet"  # CUSTOMIZE! File size must be at least a few MiB.

BOTO3_VER = f"boto3=={boto3.__version__}"
PANDAS_VER = f"pandas=={pd.__version__}"
PYARROW_VER = f"pyarrow=={pyarrow.__version__}"
SMART_OPEN_VER = f"smart_open=={smart_open.__version__}"


class Timer:
    """Measure time used."""

    # Ref: https://stackoverflow.com/a/57931660/
    def __init__(self, round_n_digits: int = 0):
        self._round_n_digits = round_n_digits
        self._start_time = timeit.default_timer()

    def __call__(self) -> float:
        return timeit.default_timer() - self._start_time

    def __str__(self) -> str:
        return str(datetime.timedelta(seconds=round(self(), self._round_n_digits)))


# Warmup using boto:
path = s3path.S3Path.from_uri(PARQUET_URI_IN)
timer = Timer()
boto3.client("s3").get_object(Bucket=str(path.bucket)[1:], Key=str(path.key))["Body"].read()
print(f"Warmed up a parquet file from S3 using {BOTO3_VER} in {timer}.")

# Read without smart_open:
timer = Timer()
df = pd.read_parquet(PARQUET_URI_IN, engine="pyarrow")
print(f"Read a dataframe from a parquet file from S3 using {PANDAS_VER} w/ {PYARROW_VER} in {timer}.")

# Read with smart_open:
timer = Timer()
with smart_open.open(PARQUET_URI_IN, "rb") as file:
    df = pd.read_parquet(file, engine="pyarrow")
print(f"Read a dataframe from a parquet file from S3 using {SMART_OPEN_VER} w/ {PANDAS_VER} w/ {PYARROW_VER} in {timer}.")

Versions

Please provide the output of:

import platform, sys, smart_open
print(platform.platform())
print("Python", sys.version)
print("smart_open", smart_open.__version__)

macOS-10.15.3-x86_64-i386-64bit
Python 3.8.4 | packaged by conda-forge | (default, Jul 17 2020, 14:54:34) 
[Clang 10.0.0 ]
smart_open 2.1.0

Output

Trial 1:

Warmed up a parquet file from S3 using boto3==1.14.3 in 0:00:03.
Read a dataframe from a parquet file from S3 using pandas==1.0.5 w/ pyarrow==0.17.1 in 0:00:06.
Read a dataframe from a parquet file from S3 using smart_open==2.1.0 w/ pandas==1.0.5 w/ pyarrow==0.17.1 in 0:00:18.

Trial 2:

Warmed up a parquet file from S3 using boto3==1.14.3 in 0:00:02.
Read a dataframe from a parquet file from S3 using pandas==1.0.5 w/ pyarrow==0.17.1 in 0:00:05.
Read a dataframe from a parquet file from S3 using smart_open==2.1.0 w/ pandas==1.0.5 w/ pyarrow==0.17.1 in 0:00:16.

Checklist

Before you create the issue, please make sure you have:

[x] Described the problem clearly
[x] Provided a minimal reproducible example, including any required data
[x] Provided the version numbers of the relevant software

Jul 22 '20 20:07 impredicative

Thanks for the detailed quality report.

Do you know how the version without smart_open fetches data from S3? Is it using boto or not?

Jul 23 '20 07:07 piskvorky

I believe pandas natively uses the boto3, s3fs, and fsspec packages to interact with S3. I don't know more.

Note that this issue doesn't exist with reading csv.bz2 files, for example, using pd.read_csv.

Jul 23 '20 11:07 impredicative

You mean when reading them from S3? So the issue is specific to the Parquet format on S3?

Jul 23 '20 13:07 piskvorky

You mean when reading them from S3?

Yes.

So the issue is specific to the Parquet format on S3?

The issue is specific to using smart_open with pandas+boto3 to read parquet from S3. Why is it a third of the speed of pandas+boto3+s3fs+fsspec??

Jul 23 '20 14:07 impredicative

As we know, parquet is a columnar data format. If the file has n columns of data, pandas could in theory try to read it in up to n streams.

Jul 23 '20 14:07 impredicative

Yes, that's what I meant. The issue appears only with smart_open and parquet, not with smart_open and csv (for example). That's a strong clue.

We'll look into this, thanks for the clear report. Although I cannot promise any timeline, we're all quite busy. If you're able to check what requests pandas is sending (vs smart_open) via boto3 yourself, that'd be great – nothing jumps to my mind immediately.

Jul 23 '20 14:07 piskvorky

Enabling DEBUG level logs may or may not help, but I'll leave this to the developers.

Jul 23 '20 21:07 impredicative

smart_open smart_open copied to clipboard

Reading parquet using smart_open+pandas is 3x slower than pandas

Problem description

Steps/code to reproduce the problem

Versions

Output

Checklist

smart_open
smart_open copied to clipboard