azure-sdk-for-python
azure-sdk-for-python copied to clipboard
High memory usage in multiple async file download
- Package Name: azure-storage-blob
- Package Version: 12.14.1
- Operating System: linux, windows
- Python Version: 3.7.15, 3.11
Describe the bug
We are using the azure.storage.blob.aio
to download a lot (~50k) of small (100 kB) blobs.
The memory usage (> 1 GB) of our program increases indefinitely over time, and it seems to be related to the BlobClient.
To Reproduce Steps to reproduce the behavior: The script below starts up to 75 concurrent blob downloads. It consistently uses more and more memory as the program iterates through all blobs (>100k) in a container.
from pathlib import Path
import asyncio
from azure.storage.blob.aio import ContainerClient
base_folder = Path("C:/Temp/Azure")
container_url = "MY_CONTAINER_URL"
async def download_blob(blob_client, blob_name):
dest_file = base_folder / blob_name
dest_file.parent.mkdir(parents=True, exist_ok=True)
with open(dest_file, "wb") as fp:
stream = await blob_client.download_blob()
data = await stream.readall()
fp.write(data)
async def main():
background_tasks = set()
sem = asyncio.Semaphore(75)
async with ContainerClient.from_container_url(container_url) as cc:
async for blob_name in cc.list_blob_names():
async with cc.get_blob_client(blob_name) as blob_client:
await sem.acquire()
task = asyncio.create_task(download_blob(blob_client, blob_name))
background_tasks.add(task)
task.add_done_callback(background_tasks.discard)
task.add_done_callback(lambda x: sem.release())
await asyncio.gather(*background_tasks)
asyncio.run(main())
cc @swathipil This seems related to #27023
Thanks for the feedback, we’ll investigate asap.
I’m no longer convinced the example accurately depicts the error.
I was creating tasks at a rapid pace, and even though there were a maximum of 75 concurrent downloads, there were many more paused tasks (with BlobClients in them) presumably consuming memory.
Hi @tboerstad Thomas, thanks for the info. My current thinking is that you may be correct in saying this is related to #27023 which we are actively investigating. For now, we'll treat this as the same. Please see other issue for when we provide updates.
Hi @tboerstad Thomas, I tried to reproduce the runaway memory usage you were seeing by cloning your scenario as closely as I could, but I was unable to reproduce the issue unfortunately. I setup a container with 50k 100 KiB blobs and ran your exact script to download them and copy them to a local directory. The memory usage stayed fairly constant for me around 70 MiB and didn't grow beyond that during the run.
I was creating tasks at a rapid pace, and even though there were a maximum of 75 concurrent downloads, there were many more paused tasks (with BlobClients in them) presumably consuming memory.
I think this sounds like it could be the cause for the memory usage. How were you measuring the paused tasks? I can try the same thing during my test to see if I get similar results.
Thank you @jalauzon-msft for the investigation. Your findings make sense, I see the same behaviour on my end.
I expect that removing the Semaphore
will reproduce the other behaviour.
Hi @tboerstad Thomas, sorry but do you still suspect an issue with the blob SDK here or do you think it's something else? I'm not sure which behavior you are referencing when you mention removing the semaphore but with the sample code as is, I wasn't able to see a memory issue.
@jalauzon-msft I have tested again, and I am also unable to see any memory issue. If I have something in the code I'm working on, it must be unrelated to the Blob SDK.
Thank you for you clarifications, you have helped solve my issue.