azure-sdk-for-python icon indicating copy to clipboard operation
azure-sdk-for-python copied to clipboard

High memory usage in multiple async file download

Open tboerstad opened this issue 2 years ago • 4 comments

  • Package Name: azure-storage-blob
  • Package Version: 12.14.1
  • Operating System: linux, windows
  • Python Version: 3.7.15, 3.11

Describe the bug We are using the azure.storage.blob.aio to download a lot (~50k) of small (100 kB) blobs. The memory usage (> 1 GB) of our program increases indefinitely over time, and it seems to be related to the BlobClient.

To Reproduce Steps to reproduce the behavior: The script below starts up to 75 concurrent blob downloads. It consistently uses more and more memory as the program iterates through all blobs (>100k) in a container.

from pathlib import Path
import asyncio

from azure.storage.blob.aio import ContainerClient

base_folder = Path("C:/Temp/Azure")
container_url = "MY_CONTAINER_URL"

async def download_blob(blob_client, blob_name):
    dest_file = base_folder / blob_name
    dest_file.parent.mkdir(parents=True, exist_ok=True)
    with open(dest_file, "wb") as fp:
        stream = await blob_client.download_blob()
        data = await stream.readall()
        fp.write(data)

async def main():
    background_tasks = set()
    sem = asyncio.Semaphore(75)
    async with ContainerClient.from_container_url(container_url) as cc:
        async for blob_name in cc.list_blob_names():
            async with cc.get_blob_client(blob_name) as blob_client:
                await sem.acquire()
                task = asyncio.create_task(download_blob(blob_client, blob_name))
                background_tasks.add(task)
                task.add_done_callback(background_tasks.discard)
                task.add_done_callback(lambda x: sem.release())

    await asyncio.gather(*background_tasks)
asyncio.run(main())

tboerstad avatar Nov 06 '22 20:11 tboerstad

cc @swathipil This seems related to #27023

tboerstad avatar Nov 06 '22 20:11 tboerstad

Thanks for the feedback, we’ll investigate asap.

xiangyan99 avatar Nov 07 '22 17:11 xiangyan99

I’m no longer convinced the example accurately depicts the error.

I was creating tasks at a rapid pace, and even though there were a maximum of 75 concurrent downloads, there were many more paused tasks (with BlobClients in them) presumably consuming memory.

tboerstad avatar Nov 07 '22 20:11 tboerstad

Hi @tboerstad Thomas, thanks for the info. My current thinking is that you may be correct in saying this is related to #27023 which we are actively investigating. For now, we'll treat this as the same. Please see other issue for when we provide updates.

jalauzon-msft avatar Nov 08 '22 19:11 jalauzon-msft

Hi @tboerstad Thomas, I tried to reproduce the runaway memory usage you were seeing by cloning your scenario as closely as I could, but I was unable to reproduce the issue unfortunately. I setup a container with 50k 100 KiB blobs and ran your exact script to download them and copy them to a local directory. The memory usage stayed fairly constant for me around 70 MiB and didn't grow beyond that during the run.

I was creating tasks at a rapid pace, and even though there were a maximum of 75 concurrent downloads, there were many more paused tasks (with BlobClients in them) presumably consuming memory.

I think this sounds like it could be the cause for the memory usage. How were you measuring the paused tasks? I can try the same thing during my test to see if I get similar results.

jalauzon-msft avatar Nov 19 '22 00:11 jalauzon-msft

Thank you @jalauzon-msft for the investigation. Your findings make sense, I see the same behaviour on my end.

I expect that removing the Semaphore will reproduce the other behaviour.

tboerstad avatar Nov 19 '22 11:11 tboerstad

Hi @tboerstad Thomas, sorry but do you still suspect an issue with the blob SDK here or do you think it's something else? I'm not sure which behavior you are referencing when you mention removing the semaphore but with the sample code as is, I wasn't able to see a memory issue.

jalauzon-msft avatar Nov 22 '22 00:11 jalauzon-msft

@jalauzon-msft I have tested again, and I am also unable to see any memory issue. If I have something in the code I'm working on, it must be unrelated to the Blob SDK.

Thank you for you clarifications, you have helped solve my issue.

tboerstad avatar Nov 23 '22 10:11 tboerstad