smart_open
smart_open copied to clipboard
slow http read performance on aws lambda
Problem description
I am trying to stream data from a website using http directly into s3 using smart_open using an AWS Lambda function. Testing has shown that the http read with smart_open is much slower than the same function using requests directly, by about an order of magnitude, so the examples reflect that for simplicity of reproduction.
Tests on a local machine do not show the same discrepancy.
I may well be doing this wrong as I couldn't find an example of how to do this, but happy to contribute one if someone can put me right.
Steps/code to reproduce the problem
Fast version ~ 5 seconds
import logging
import requests
from smart_open import open as s_open
CHUNK_SIZE = 100 * 1024**2
MP_UPLOAD_SIZE = 50 * 1024**2
with s_open( "/tmp/100.bin", 'wb') as fout:
with requests.get("https://speed.hetzner.de/100MB.bin", stream=True) as r:
r.raise_for_status()
for chunk in r.iter_content(chunk_size=CHUNK_SIZE):
fout.write(chunk)
slow version ~ 170 seconds
import logging
from smart_open import open as s_open
CHUNK_SIZE = 100 * 1024**2
MP_UPLOAD_SIZE = 50 * 1024**2
with s_open( "/tmp/100.bin", 'wb') as fout:
with s_open('https://speed.hetzner.de/100MB.bin' , 'rb') as fin:
chunk = b'0'
while len(chunk) > 0:
chunk = fin.read(CHUNK_SIZE)
fout.write(chunk)
Versions
Linux-4.14.255-276-224.499.amzn2.x86_64-x86_64-with-glibc2.26
Python 3.9.13 (main, Jun 10 2022, 16:49:31) [GCC 7.3.1 20180712 (Red Hat 7.3.1-15)]
smart_open 6.0.0
Checklist
Before you create the issue, please make sure you have:
- [x] Described the problem clearly
- [x] Provided a minimal reproducible example, including any required data
- [x] Provided the version numbers of the relevant software
The fast version does not run correctly because of a missing requests import. But, even after fixing that, I still could not reproduce the problem.
$ cat gitignore/slow.py
import logging
from smart_open import open as s_open
CHUNK_SIZE = 100 * 1024**2
MP_UPLOAD_SIZE = 50 * 1024**2
with s_open( "/tmp/100.bin", 'wb') as fout:
with s_open('https://speed.hetzner.de/100MB.bin' , 'rb') as fin:
chunk = b'0'
while len(chunk) > 0:
chunk = fin.read(CHUNK_SIZE)
fout.write(chunk)
$ cat gitignore/fast.py
import logging
import requests
from smart_open import open as s_open
CHUNK_SIZE = 100 * 1024**2
MP_UPLOAD_SIZE = 50 * 1024**2
with s_open( "/tmp/100.bin", 'wb') as fout:
with requests.get("https://speed.hetzner.de/100MB.bin", stream=True) as r:
r.raise_for_status()
for chunk in r.iter_content(chunk_size=CHUNK_SIZE):
fout.write(chunk)
$ time python gitignore/slow.py
real 1m43.211s
user 0m14.324s
sys 0m18.733s
$ time python gitignore/fast.py
real 2m4.085s
user 0m0.987s
sys 0m0.763s
Thanks for pointing out the typo, will fix in my original post.
Tested locally on 3.9 to rule that out:
python3.9 fast.py 2.20s user 2.02s system 6% cpu 1:03.77 total
python3.9 slow.py 8.83s user 5.79s system 18% cpu 1:18.89 total
And double checked again on Lambda:
fast
REPORT RequestId:xxxxDuration: 3457.29 ms Billed Duration: 3458 ms Memory Size: 350 MB Max Memory Used: 258 MB Init Duration: 407.47 ms
slow
REPORT RequestId: xxx Duration: 121736.98 ms Billed Duration: 121737 ms Memory Size: 350 MB Max Memory Used: 259 MB Init Duration: 438.44 ms
As you can see the differential is negligible locally, huge on lambda. Not sure what is going on there. Possibly something to do with memory availability?
I don't have much experience with Lambda, so it's difficult for me to comment.
It's odd that the slow version is still more than twice as slow locally, though... Are you able to investigate why there is such a difference? There is a small chance that this difference is what's causing the huge slowdown on Lambda.
The way I would approach this is:
- Make the slow version behave identical to the fast one locally (possibly by modifying smart_open)
- Re-run the slow version on Lambda and test the duration
I am able to reproduce this with s3 and https request as well.
with python requests library to read https stream - <1s to upload 70 MB file
with smart_open it takes 750s to upload 70MB file
requests:
with requests.get(uri, stream=True) as r:
r.raise_for_status()
with sm_open(f"s3://{bucket_name}/{file_path}/{file_name}", "wb", transport_params=transport_params) as fout:
for chunk in r.iter_content(chunk_size=CHUNK_SIZE):
fout.write(chunk)
~ 10MB/s - I believe this is because I have the chunk size as 10MB
sm_open:
with sm_open(uri, "rb") as fin:
with sm_open(f"s3://{bucket_name}/{file_path}/{file_name}", "wb", transport_params=transport_params) as fout:
for line in fin:
fout.write(line)
~ 0.093 MB/s - I could try chunking like above but I wouldn't expect a slow down of this order of magnitude.
Are you able to profile the code to work out where the time-consuming part is? It seems that downloading is slow, because you're using smart_open for the upload in both cases. If so, then we can probably eliminate the upload component altogether, and look for the problem in the download component.
Also, ensure compression isn't causing the slow-down. By default, smart_open uses the file extension to transparently handle compression.