smart_open
smart_open copied to clipboard
Streaming data transfer rates
Hi all - thank you for the great package.
This is just a quick question about expected download rates. I'm seeing rates of ~2Mb/s streaming data from an S3 bucket to a Lambda function that are both in the same region. In total, I can stream a 300mb file in ~159 seconds.
Are these rates to be expected using the package or is there something I am missing?
Thank you!
You can see benchmarks by running pytest integration-tests/test_s3.py::test_s3_performance
. This test use the default buffer_size for smart_open.s3.open
. You can probably increase performance substantially by increasing the buffer_size kwarg passed into smart_open.open
. This might require increasing the memory allocated to your lambda function if you use a very large buffer size.
Thanks for the response! I've been modifying the buffer_size
value from 1024 to 262144, in increments that multiply by 4 each time (so 1024, 4096, .... ), and I'm still getting a very similar transfer speed.
Just checked that the default rate is much higher. I've now benchmarked much higher values, from 4 * 128 * 1024 in intervals increasing by 4x to 32 * 128 * 1024 but I'm still seeing similar results.
What is "default rate" and how did you check it's "much higher"?
By "default rate" I mean DEFAULT_BUFFER_SIZE
that is defined here:
https://github.com/RaRe-Technologies/smart_open/blob/master/smart_open/s3.py#L38
Running the integration-tests from the root directory gives me:
(venv) ➜ smart_open git:(master) ✗ pytest integration-tests/test_s3.py
ERROR: usage: pytest [options] [file_or_dir] [file_or_dir] [...]
pytest: error: unrecognized arguments: --reruns --reruns-delay 1 integration-tests/test_s3.py
inifile: /Users/fonz/Documents/Projects/smart_open/tox.ini
rootdir: /Users/fonz/Documents/Projects/smart_open`
When you say “streaming” are you reading from S3, writing to S3, or both? I think the buffer_size is related to reading and min_part_size kwarg is related to writing
I'm specifically reading from S3. I seem to be able to download the 330mb file file in 6 seconds using boto3's get_object().read()
, but using smart_open
this seems to take 159 seconds.
Additionally, using get_object().iter_lines()
seems to iterate through the file in 8 seconds.
I just want to check if I'm missing anything here!
20x slower is really weird. There should be very little overhead in smart_open
, so the numbers ought to ± match.
Btw get_object().iter_lines()
didn't exist back then, maybe it's worth changing our "S3 read" implementation to that @mpenkov ? Pros: less code in smart_open
, easier maintenance, free updates when the boto API changes. Cons: ?
@willgdjones can you post a full reproducible example, with the exact code you're running? Both for the smart_open code and the native boto code. Thanks.
I've noticed that actually decompressing the file takes up a large amount of time that I was not previously factoring. The following loop for the same file takes ~50 seconds:
with gzip.open(s3_client.get_object(Bucket=bucket, Key=key)["Body"]) as gf:
for x in gf:
pass
whereas this loop takes ~8 seconds:
for x in s3_client.get_object(Bucket=bucket, Key=key)["Body"].iter_lines():
pass
The smart_open
code I am running looks like this, and takes ~159 seconds:
from smart_open import open as new_open
for line in new_open(f"s3://{bucket}/{key}", transport_params=dict(buffer_size=32*128*1024)):
pass
Is there any indication as to why this is the case? Seems strange that smart_open would take 3x longer for gzipped files than the first approach; is it just because it is decompressing chunk-by-chunk? Just curious because we've been seeing similar issues with very slow streaming of small to medium sized gzipped files via smart_open, with basically identical usage to what is described here.