smart_open slow http read performance on aws lambda

trafficstars

Problem description

I am trying to stream data from a website using http directly into s3 using smart_open using an AWS Lambda function. Testing has shown that the http read with smart_open is much slower than the same function using requests directly, by about an order of magnitude, so the examples reflect that for simplicity of reproduction.

Tests on a local machine do not show the same discrepancy.

I may well be doing this wrong as I couldn't find an example of how to do this, but happy to contribute one if someone can put me right.

Steps/code to reproduce the problem

Fast version ~ 5 seconds

import logging
import requests
from smart_open import open as s_open

CHUNK_SIZE = 100 * 1024**2
MP_UPLOAD_SIZE = 50 * 1024**2

with s_open( "/tmp/100.bin", 'wb') as fout:
    with requests.get("https://speed.hetzner.de/100MB.bin", stream=True) as r:
         r.raise_for_status()
         for chunk in r.iter_content(chunk_size=CHUNK_SIZE):
             fout.write(chunk)

slow version ~ 170 seconds

import logging
from smart_open import open as s_open

CHUNK_SIZE = 100 * 1024**2
MP_UPLOAD_SIZE = 50 * 1024**2

with s_open( "/tmp/100.bin", 'wb') as fout:
    with s_open('https://speed.hetzner.de/100MB.bin' , 'rb') as fin: 
        chunk = b'0'
        while len(chunk) > 0:
            chunk = fin.read(CHUNK_SIZE)
            fout.write(chunk)

Versions

Linux-4.14.255-276-224.499.amzn2.x86_64-x86_64-with-glibc2.26
Python 3.9.13 (main, Jun 10 2022, 16:49:31) [GCC 7.3.1 20180712 (Red Hat 7.3.1-15)]
smart_open 6.0.0

Checklist

Before you create the issue, please make sure you have:

[x] Described the problem clearly
[x] Provided a minimal reproducible example, including any required data
[x] Provided the version numbers of the relevant software

Jul 20 '22 11:07 stev-0

The fast version does not run correctly because of a missing requests import. But, even after fixing that, I still could not reproduce the problem.

$ cat gitignore/slow.py 
import logging
from smart_open import open as s_open

CHUNK_SIZE = 100 * 1024**2
MP_UPLOAD_SIZE = 50 * 1024**2

with s_open( "/tmp/100.bin", 'wb') as fout:
    with s_open('https://speed.hetzner.de/100MB.bin' , 'rb') as fin: 
        chunk = b'0'
        while len(chunk) > 0:
            chunk = fin.read(CHUNK_SIZE)
            fout.write(chunk)
$ cat gitignore/fast.py 
import logging
import requests
from smart_open import open as s_open

CHUNK_SIZE = 100 * 1024**2
MP_UPLOAD_SIZE = 50 * 1024**2

with s_open( "/tmp/100.bin", 'wb') as fout:
    with requests.get("https://speed.hetzner.de/100MB.bin", stream=True) as r:
         r.raise_for_status()
         for chunk in r.iter_content(chunk_size=CHUNK_SIZE):
             fout.write(chunk)

$ time python gitignore/slow.py 

real    1m43.211s
user    0m14.324s
sys     0m18.733s
$ time python gitignore/fast.py 

real    2m4.085s
user    0m0.987s
sys     0m0.763s

Jul 29 '22 06:07 mpenkov

Thanks for pointing out the typo, will fix in my original post.

Tested locally on 3.9 to rule that out:

python3.9 fast.py 2.20s user 2.02s system 6% cpu 1:03.77 total python3.9 slow.py 8.83s user 5.79s system 18% cpu 1:18.89 total

And double checked again on Lambda:

fast

REPORT RequestId:xxxxDuration: 3457.29 ms Billed Duration: 3458 ms Memory Size: 350 MB Max Memory Used: 258 MB Init Duration: 407.47 ms

slow

REPORT RequestId: xxx Duration: 121736.98 ms Billed Duration: 121737 ms Memory Size: 350 MB Max Memory Used: 259 MB Init Duration: 438.44 ms

As you can see the differential is negligible locally, huge on lambda. Not sure what is going on there. Possibly something to do with memory availability?

Jul 29 '22 15:07 stev-0

I don't have much experience with Lambda, so it's difficult for me to comment.

It's odd that the slow version is still more than twice as slow locally, though... Are you able to investigate why there is such a difference? There is a small chance that this difference is what's causing the huge slowdown on Lambda.

The way I would approach this is:

Make the slow version behave identical to the fast one locally (possibly by modifying smart_open)
Re-run the slow version on Lambda and test the duration

Jul 30 '22 12:07 mpenkov

I am able to reproduce this with s3 and https request as well.

with python requests library to read https stream - <1s to upload 70 MB file with smart_open it takes 750s to upload 70MB file

requests:

with requests.get(uri, stream=True) as r:
        r.raise_for_status()
        with sm_open(f"s3://{bucket_name}/{file_path}/{file_name}", "wb", transport_params=transport_params) as fout:
            for chunk in r.iter_content(chunk_size=CHUNK_SIZE):
                fout.write(chunk)

~ 10MB/s - I believe this is because I have the chunk size as 10MB

sm_open:

with sm_open(uri, "rb") as fin:
        with sm_open(f"s3://{bucket_name}/{file_path}/{file_name}", "wb", transport_params=transport_params) as fout:
            for line in fin:
                fout.write(line)

~ 0.093 MB/s - I could try chunking like above but I wouldn't expect a slow down of this order of magnitude.

Mar 14 '23 16:03 ShailChoksi

Are you able to profile the code to work out where the time-consuming part is? It seems that downloading is slow, because you're using smart_open for the upload in both cases. If so, then we can probably eliminate the upload component altogether, and look for the problem in the download component.

Also, ensure compression isn't causing the slow-down. By default, smart_open uses the file extension to transparently handle compression.

Mar 15 '23 02:03 mpenkov

smart_open smart_open copied to clipboard

slow http read performance on aws lambda

Problem description

Steps/code to reproduce the problem

Versions

Checklist

smart_open
smart_open copied to clipboard