smart_open
smart_open copied to clipboard
Reading parquet using smart_open+pandas is 3x slower than pandas
Problem description
Reading a parquet file from S3 with smart_open + pandas + pyarrow is seriously slower (3x) than if using just pandas + pyarrow. I independently tried optimizing buffering and buffer_size with no luck.
Steps/code to reproduce the problem
import datetime
import timeit
import boto3
import pandas as pd
import pyarrow
import s3path
import smart_open
PARQUET_URI_IN = "s3://PLEASE-USE-YOUR/OWN/FILE.parquet" # CUSTOMIZE! File size must be at least a few MiB.
BOTO3_VER = f"boto3=={boto3.__version__}"
PANDAS_VER = f"pandas=={pd.__version__}"
PYARROW_VER = f"pyarrow=={pyarrow.__version__}"
SMART_OPEN_VER = f"smart_open=={smart_open.__version__}"
class Timer:
"""Measure time used."""
# Ref: https://stackoverflow.com/a/57931660/
def __init__(self, round_n_digits: int = 0):
self._round_n_digits = round_n_digits
self._start_time = timeit.default_timer()
def __call__(self) -> float:
return timeit.default_timer() - self._start_time
def __str__(self) -> str:
return str(datetime.timedelta(seconds=round(self(), self._round_n_digits)))
# Warmup using boto:
path = s3path.S3Path.from_uri(PARQUET_URI_IN)
timer = Timer()
boto3.client("s3").get_object(Bucket=str(path.bucket)[1:], Key=str(path.key))["Body"].read()
print(f"Warmed up a parquet file from S3 using {BOTO3_VER} in {timer}.")
# Read without smart_open:
timer = Timer()
df = pd.read_parquet(PARQUET_URI_IN, engine="pyarrow")
print(f"Read a dataframe from a parquet file from S3 using {PANDAS_VER} w/ {PYARROW_VER} in {timer}.")
# Read with smart_open:
timer = Timer()
with smart_open.open(PARQUET_URI_IN, "rb") as file:
df = pd.read_parquet(file, engine="pyarrow")
print(f"Read a dataframe from a parquet file from S3 using {SMART_OPEN_VER} w/ {PANDAS_VER} w/ {PYARROW_VER} in {timer}.")
Versions
Please provide the output of:
import platform, sys, smart_open
print(platform.platform())
print("Python", sys.version)
print("smart_open", smart_open.__version__)
macOS-10.15.3-x86_64-i386-64bit
Python 3.8.4 | packaged by conda-forge | (default, Jul 17 2020, 14:54:34)
[Clang 10.0.0 ]
smart_open 2.1.0
Output
Trial 1:
Warmed up a parquet file from S3 using boto3==1.14.3 in 0:00:03.
Read a dataframe from a parquet file from S3 using pandas==1.0.5 w/ pyarrow==0.17.1 in 0:00:06.
Read a dataframe from a parquet file from S3 using smart_open==2.1.0 w/ pandas==1.0.5 w/ pyarrow==0.17.1 in 0:00:18.
Trial 2:
Warmed up a parquet file from S3 using boto3==1.14.3 in 0:00:02.
Read a dataframe from a parquet file from S3 using pandas==1.0.5 w/ pyarrow==0.17.1 in 0:00:05.
Read a dataframe from a parquet file from S3 using smart_open==2.1.0 w/ pandas==1.0.5 w/ pyarrow==0.17.1 in 0:00:16.
Checklist
Before you create the issue, please make sure you have:
- [x] Described the problem clearly
- [x] Provided a minimal reproducible example, including any required data
- [x] Provided the version numbers of the relevant software
Thanks for the detailed quality report.
Do you know how the version without smart_open fetches data from S3? Is it using boto or not?
I believe pandas natively uses the boto3, s3fs, and fsspec packages to interact with S3. I don't know more.
Note that this issue doesn't exist with reading csv.bz2 files, for example, using pd.read_csv.
You mean when reading them from S3? So the issue is specific to the Parquet format on S3?
You mean when reading them from S3?
Yes.
So the issue is specific to the Parquet format on S3?
The issue is specific to using smart_open with pandas+boto3 to read parquet from S3. Why is it a third of the speed of pandas+boto3+s3fs+fsspec??
As we know, parquet is a columnar data format. If the file has n columns of data, pandas could in theory try to read it in up to n streams.
Yes, that's what I meant. The issue appears only with smart_open and parquet, not with smart_open and csv (for example). That's a strong clue.
We'll look into this, thanks for the clear report. Although I cannot promise any timeline, we're all quite busy. If you're able to check what requests pandas is sending (vs smart_open) via boto3 yourself, that'd be great – nothing jumps to my mind immediately.
Enabling DEBUG level logs may or may not help, but I'll leave this to the developers.