azure-sdk-for-python
azure-sdk-for-python copied to clipboard
High memory usage in multiple async files upload
- Package Name: azure-storage-blob
- Package Version: 12.12.0, 12.14.0
- Operating System: linux
- Python Version: 3.6.9, 3.8, 3.10
Describe the bug
We are using the azure.storage.blob.aio package to upload multiple files to our storage container. In order to make the upload efficient, we are creating a batch of async upload tasks and executing them all using await asyncio.gather(*tasks).
After some time, we encountered a very high memory consumption of the container running this app, which constantly increases.
I tried to investigate what is using all the memory and it seems to me that every execution of SDK's blob_client.upload_blob adds few MBs to the memory, without releasing it.
To Reproduce Steps to reproduce the behavior: I was able to reproduce the issue with the following snippet
async def upload(storage_path):
async with ContainerClient.from_connection_string(conn_str=get_connection_string(), container_name=CONTAINER_NAME) as container_client:
blob_client = container_client.get_blob_client(blob=storage_path)
with open(TEST_FILE, 'rb') as file_to_upload:
await blob_client.upload_blob(file_to_upload, length=os.path.getsize(TEST_FILE), overwrite=True)
await blob_client.close()
@profile
async def run_multi_upload(n):
tasks = []
for i in range(n):
tasks.append(upload(f"storage_client_memory/test_file_{i}"))
await asyncio.gather(*tasks)
if __name__ == '__main__':
asyncio.run(run_multi_upload(100))
Expected behavior I was expecting to have a normal memory consumption since I'm not actively loading anything unusual to memory.
Screenshots
I used the memory_profiler package to check the reason for the high memory consumption, this is its output for running the above snippet:
for a single async file upload we can see that the blob_client.upload_blob adds few MB to memory -
Line # Mem usage Increment Occurrences Line Contents
=============================================================
18 114.3 MiB 104.3 MiB 76 @profile
19 async def upload(storage_path):
20 114.3 MiB 3.6 MiB 76 async with ContainerClient.from_connection_string(conn_str=get_connection_string(), container_name=CONTAINER_NAME) as container_client:
21 114.3 MiB 3.8 MiB 76 blob_client = container_client.get_blob_client(blob=storage_path)
22 114.3 MiB 0.0 MiB 76 with open(TEST_FILE, 'rb') as file_to_upload:
23 125.6 MiB 13.8 MiB 474 await blob_client.upload_blob(file_to_upload, length=os.path.getsize(TEST_FILE), overwrite=True)
24 125.6 MiB 0.0 MiB 172 await blob_client.close()
And in total, the await asyncio.gather(*tasks) adds 24.9 MiB -
Line # Mem usage Increment Occurrences Line Contents
=============================================================
25 99.8 MiB 99.8 MiB 1 @profile
26 async def run_multi_upload(n):
27 99.8 MiB 0.0 MiB 1 tasks = []
28 99.8 MiB 0.0 MiB 101 for i in range(n):
29 99.8 MiB 0.0 MiB 100 tasks.append(upload(f"storage_client_memory/test_file_{i}"))
30 124.7 MiB 24.9 MiB 2 await asyncio.gather(*tasks)
Additional context My app is running in a Kubernetes cluster as a side-car container and constantly uploads files from the cluster to our storage. I'm running the uploads in batches of 20 async tasks. After 30 minutes in which I uploaded ~30,000 files, the container reached 1.5Gb of memory consumption. This was as part of a stress test for the container.
Hi @morpel - Thanks for the detailed report! We'll investigate asap!
Hi @morpel, just wanted to update that we are still investigating this and trying to reproduce the issue on our end. Will update when we know more. Thanks.
Hi again @morpel, we were able to reproduce your findings on our side and did some investigation to determine what was happening. While we are still investigating the root cause, we believe we have a recommendation that should solve this memory issue for you.
Ultimately, this seems to be caused by you creating a new client instance for each request. We usually recommend that you re-use a single client instance for your application and in this case, that should help with memory. Here is what that might look like in your sample:
import asyncio
from azure.storage.blob.aio import ContainerClient
async def upload(client: ContainerClient, storage_path):
with open(TEST_FILE, 'rb') as file_to_upload:
await client.upload_blob(storage_path, file_to_upload, length=os.path.getsize(TEST_FILE), overwrite=True)
async def run_multi_upload(n):
async with ContainerClient.from_connection_string(conn_str=get_connection_string(), container_name=CONTAINER_NAME) as container_client:
tasks = []
for i in range(n):
tasks.append(upload(container_client, f"storage_client_memory/test_file_{i}"))
await asyncio.gather(*tasks)
if __name__ == '__main__':
asyncio.run(run_multi_upload(100))
Note: I also switched to calling upload_blob directly from ContainerClient to avoid the creation of a separate BlobClient but that may or may not be necessary.
We are still investigating the reason why making a new client each time leads to increased memory usage, but part of it definitely seems to be related to Garbage Collection. Creating a new client each time leads to a lot of stale memory that is just waiting to be garbage collected and without high memory pressure, it may not always be garbage collected right away. We did see that forcing garbage collection did help but this wouldn't really be recommended in a production scenario. There does also seem to be another piece to the puzzle though that we are still investigating as forcing garbage collection didn't completely solve the issue.
Anyway, hopefully this recommendation to re-use your client instance can help in your scenario. Please give it a try and let us know how it goes. Thanks.
Thank you @jalauzon-msft .
So how would you recommend using the ContainerClient in an app that is constantly running and uploading files to storage? Should I create the containerClient instance on startup and never close it? Otherwise I will have to recreate it for each upload and it seems that I will cause some memory issues as well
@jalauzon-msft Can you also check back at #27320?
That example uses a single ContainerClient, and calls download_blob() directly without creating a BlobClient, yet it still gobbles gigabytes of memory for ~100k blobs.
Hi @morpel, creating a single ContainerClient instance on startup and using it for all your requests should be find and would be the recommended approach. This should work across threads as well.