azure-sdk-for-python Performance worsens greatly when downloading blobs with long breaks in between

Package Name: azure-storage-blob
Package Version: 12.6.0 (but it also happens on 12.13.1)
Operating System: linux (but it doesn't really matter, it happens on my mac too)
Python Version: 3.8

Describe the bug When using the asyncio SDK, if we wait more than 10 seconds between blob downloads the download time becomes 10-100 times slower.

To Reproduce I wrote this script to demonstrate the issue:

import random
from asyncio import gather, sleep, run
from time import perf_counter
from azure.storage.blob.aio import BlobServiceClient, ContainerClient

all_times = []

async def download(client: ContainerClient, path: str):
    s = perf_counter()
    downloader = await client.download_blob(path)
    await downloader.readall()
    e = (perf_counter() - s) * 1000
    all_times.append(e)
    print(f"download took {e}ms file={path}")

def get_url_and_key():
    ### censored for security reasons ###

async def do():
    account_url, account_key = get_url_and_key()
    blob_client = BlobServiceClient(account_url, account_key)
    container_client = blob_client.get_container_client('some-container')

    file_pool = []
    async for x in container_client.list_blobs():
        if len(file_pool) > 50:
            break
        file_pool.append(x.name)

    for i in range(30):
        files = set()
        while len(files) < 3:
            files.add(file_pool[random.randint(0, len(file_pool) - 1)])
        await gather(*[download(container_client, file) for file in files])
        await sleep(20)

    await container_client.close()
    await blob_client.close()

    print(f"average is {sum(all_times) / len(all_times)}ms")

run(do())

what this script essentially does:

find the names of 50 blobs that exist in the container
randomly choose 3 of them to download
download them asynchronously (gather 3 tasks) and print the download times
sleep a certain amount of seconds
repeat (30 times) and print the average download time

if the sleep time is set to 10 seconds or below I get expectedly low download times:

download took 23.274756036698818ms file=+1b2+N3X7cgkoaajoUSlCA== download took 31.523440033197403ms file=+/K/vccgIKxIsFW6tk8l3Q== download took 44.199715834110975ms file=+/dVLONMzj3tMtql+Zv7Jg== download took 33.404441084712744ms file=+/UeV81yZRT+Eh5jeUKlOA== . . . download took 12.770875822752714ms file=+16GIyseiqsaO8T+mrXLQg== download took 9.113383013755083ms file=+1YF6MfP5Xb+sWEqgr36LQ== download took 7.825685199350119ms file=+/ZJM5EQ40nni2+zKjJA4A== download took 7.158686872571707ms file=+1b2+N3X7cgkoaajoUSlCA== average is 20.692046359181404ms

but if I set it to 20 seconds for example, the download times become insanely high (beyond the first batch):

download took 9.756594430655241ms file=+0huQr/6tn7zw5+p2fT/2A== download took 20.794587209820747ms file=+1b2+N3X7cgkoaajoUSlCA== download took 45.60287203639746ms file=+1h1V/YFgN130/9kzcIygQ== download took 885.5280890129507ms file=++nRJgH2HUF30/9kzcIygQ== download took 884.8122912459075ms file=+1YF6MfP5Xb+sWEqgr36LQ== download took 905.2660423330963ms file=+/soiSTB/5Wb0zHn28+O6Q== . . . download took 930.0287216901779ms file=+1ZAwh3CRGuuBvav8e/34w== download took 927.8017273172736ms file=+1QWh0iEdgVt/VX2PlPDpw== download took 940.912198740989ms file=+/FxOL5ySAQaO8T+mrXLQg== download took 639.4636919721961ms file=++O8gZmF/40FPYL5+VRhAA== download took 664.1455427743495ms file=+/soiSTB/5Wb0zHn28+O6Q== download took 669.6559321135283ms file=++zejjUIwmWuBvav8e/34w== average is 463.2759254503374ms

I've run this countless times with different sleep times on my own machine (macOS) and on the production machine (linux) at different traffic loads and with different storage accounts - and always got the same results, so we can rule out load issues on the azure blob storage, or file size variance.

It seems to me that after waiting a certain amount of time the connection with the azure blob storage dies and takes forever to revive. It seems like a bug to me, but if there's a way to tweak that so the connection stays live for 3 minutes of inactivity for example (eliminating the need to revive it), it would solve the issue for now.

We've moved to the async SDK to be able to download many files asynchronously and reduce our response time, but it seems like the opposite has happened and it greatly increased it, to the point this SDK is unusable to us. Quick assistance will be greatly appreciated.

Sep 20 '22 14:09 asafalon1

Hi @asafalon1 thank you for the feedback, we'll get back to you asap.

Sep 20 '22 15:09 l0lawrence

Update: I wrote the same logic using the non-async SDK using a thread pool executor, and the results were great, further affirming that the issue is a bug in the async SDK and not some system/usage issue.

import random
from asyncio import gather, sleep, run, get_event_loop
from concurrent.futures import ThreadPoolExecutor
from contextvars import ContextVar
from uuid import uuid4
from time import perf_counter
from typing import Tuple
from azure.storage.blob import BlobServiceClient, ContainerClient

all_times = []
client: ContextVar[Tuple[str, ContainerClient]] = ContextVar("client")

def get_url_and_key():
    ### censored for security reasons ###

def create_client():
    account_url, account_key = get_url_and_key()
    blob_client = BlobServiceClient(account_url, account_key)
    client_id = str(uuid4())
    client.set((client_id, blob_client.get_container_client('some-container')))
    print(f"client_created {client_id}")

executor = ThreadPoolExecutor(max_workers=10, initializer=create_client)

def get_file_pool():
    account_url, account_key = get_url_and_key()
    blob_client = BlobServiceClient(account_url, account_key)
    container_client = blob_client.get_container_client('some-container')
    file_pool = []
    for x in container_client.list_blobs():
        if len(file_pool) > 50:
            break
        file_pool.append(x.name)
    container_client.close()
    blob_client.close()
    return file_pool

def download(path: str):
    s = perf_counter()
    client_id, blob_client = client.get()
    blob_client.download_blob(path).readall()
    e = (perf_counter() - s) * 1000
    all_times.append(e)
    print(f"download took {e}ms    file={path}    client_id={client_id}")

async def do():
    file_pool = get_file_pool()
    loop = get_event_loop()

    for i in range(20):
        files = set()
        while len(files) < 3:
            files.add(file_pool[random.randint(0, len(file_pool) - 1)])
        await gather(*[loop.run_in_executor(executor, download, file) for file in files])
        await sleep(20)

    print(f"average is {sum(all_times) / len(all_times)}ms")

run(do())

client_created 3a44a5dd-f587-49b0-b35d-8bad288f1eb3 client_created 34f6c710-133d-45fa-9066-d38e2088802d client_created 14ac5d66-128e-459c-99f5-db5406db4ee8 download took 40.091779083013535ms file=+1b2+N3X7cgkoaajoUSlCA== client_id=3a44a5dd-f587-49b0-b35d-8bad288f1eb3 download took 33.937398344278336ms file=++gHmm6FBqPRU3BlU4/rJQ== client_id=14ac5d66-128e-459c-99f5-db5406db4ee8 download took 93.85131718590856ms file=+/dVLONMzj3tMtql+Zv7Jg== client_id=34f6c710-133d-45fa-9066-d38e2088802d download took 13.473261147737503ms file=+1b2+N3X7cgkoaajoUSlCA== client_id=3a44a5dd-f587-49b0-b35d-8bad288f1eb3 download took 12.802662793546915ms file=+0/fKSukSOuuBvav8e/34w== client_id=14ac5d66-128e-459c-99f5-db5406db4ee8 . . . download took 14.066467061638832ms file=++83HDhOOtxKVTQz0XMVjg== client_id=14ac5d66-128e-459c-99f5-db5406db4ee8 download took 57.9378642141819ms file=+/dVLONMzj3tMtql+Zv7Jg== client_id=3a44a5dd-f587-49b0-b35d-8bad288f1eb3 download took 25.819737929850817ms file=+0huQr/6tn7zw5+p2fT/2A== client_id=34f6c710-133d-45fa-9066-d38e2088802d download took 26.553436182439327ms file=+1tWvJpKpXdKVTQz0XMVjg== client_id=14ac5d66-128e-459c-99f5-db5406db4ee8 download took 29.68962863087654ms file=+/K/vccgIKxIsFW6tk8l3Q== client_id=3a44a5dd-f587-49b0-b35d-8bad288f1eb3 average is 36.957062470416226ms

Sep 21 '22 11:09 asafalon1

Hi @asafalon1, thanks for the report. We have been able to reproduce this on our end but the cause is not obvious so we will investigate and get back to you. Thanks!

Sep 27 '22 00:09 jalauzon-msft

Hi @asafalon1, we have done some fairly extensive investigation into this issue, and we believe this is mainly being caused by TCP Connection re-use in the aiohttp layer.

When a TCP Connection is first opened to the server, it takes additional time to establish that connection. Additionally, if configured, an existing TCP Connection can be kept open (alive) for a period of time where the connection will not need to be re-established and can be re-used. For aiohttp, the HTTP library behind our async library, the default value for this this idle timeout is 15 seconds. So, when we sleep for 20 seconds, the TCP connection that was open by the first set of requests is closed and it takes time to reopen this request on the next set of downloads. Additionally, my guess for why this affects all executing threads is probably because there would be one connection per thread in this case.

What we don't fully understand yet is why the additional download time is not fixed and seems to scale with the blob size. The time taken to establish a TCP connection should be fixed but it seems there is something additional happening in aiohttp causing this additional time to scale with the response size.

The good news is that this idle timeout is configurable on the aiohttp side and we have found that increasing it beyond how long you are sleeping, seems to fix the issue. Currently the best way to configure this would be to provide and configure your own ClientSession object to the BlobServiceClient. This can be done like this:

from aiohttp import ClientSession, TCPConnector

connector = TCPConnector(keepalive_timeout=120)
session = ClientSession(connector=connector)
blob_client = BlobServiceClient(account_url, account_key, session=session)

The one caveat here is that you are no longer using the default Session that azure-core provides for you which may mean missing out on some of our default configuration. Currently we don't add much additional configuration beyond the defaults so it may not be a problem. You can see where azure-core constructs the ClientSession here so you may want to consider adding this additional configuration as well. We are currently thinking about ways to improve specifying this configuration.

Hopefully that information helps you. Please let me know if you have any further questions. Thanks.

CC @xiangyan99 and @annatisch

Oct 06 '22 00:10 jalauzon-msft

Hi @asafalon1. Thank you for opening this issue and giving us the opportunity to assist. We believe that this has been addressed. If you feel that further discussion is needed, please add a comment with the text “/unresolve” to remove the “issue-addressed” label and continue the conversation.

Oct 07 '22 16:10 ghost

Hi @asafalon1, since you haven’t asked that we “/unresolve” the issue, we’ll close this out. If you believe further discussion is needed, please add a comment “/unresolve” to reopen the issue.

Oct 14 '22 22:10 ghost

azure-sdk-for-python azure-sdk-for-python copied to clipboard

Performance worsens greatly when downloading blobs with long breaks in between

azure-sdk-for-python
azure-sdk-for-python copied to clipboard