azure-sdk-for-python High memory usage in multiple async files upload

Package Name: azure-storage-blob
Package Version: 12.12.0, 12.14.0
Operating System: linux
Python Version: 3.6.9, 3.8, 3.10

Describe the bug We are using the azure.storage.blob.aio package to upload multiple files to our storage container. In order to make the upload efficient, we are creating a batch of async upload tasks and executing them all using await asyncio.gather(*tasks). After some time, we encountered a very high memory consumption of the container running this app, which constantly increases. I tried to investigate what is using all the memory and it seems to me that every execution of SDK's blob_client.upload_blob adds few MBs to the memory, without releasing it.

To Reproduce Steps to reproduce the behavior: I was able to reproduce the issue with the following snippet

async def upload(storage_path):
    async with ContainerClient.from_connection_string(conn_str=get_connection_string(), container_name=CONTAINER_NAME) as container_client:
        blob_client = container_client.get_blob_client(blob=storage_path)
        with open(TEST_FILE, 'rb') as file_to_upload:
            await blob_client.upload_blob(file_to_upload, length=os.path.getsize(TEST_FILE), overwrite=True)
        await blob_client.close()


@profile
async def run_multi_upload(n):
    tasks = []
    for i in range(n):
        tasks.append(upload(f"storage_client_memory/test_file_{i}"))
    await asyncio.gather(*tasks)

if __name__ == '__main__':
    asyncio.run(run_multi_upload(100))

Expected behavior I was expecting to have a normal memory consumption since I'm not actively loading anything unusual to memory.

Screenshots I used the memory_profiler package to check the reason for the high memory consumption, this is its output for running the above snippet:

for a single async file upload we can see that the blob_client.upload_blob adds few MB to memory -

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    18    114.3 MiB    104.3 MiB          76   @profile
    19                                         async def upload(storage_path):
    20    114.3 MiB      3.6 MiB          76       async with ContainerClient.from_connection_string(conn_str=get_connection_string(), container_name=CONTAINER_NAME) as container_client:
    21    114.3 MiB      3.8 MiB          76           blob_client = container_client.get_blob_client(blob=storage_path)
    22    114.3 MiB      0.0 MiB          76           with open(TEST_FILE, 'rb') as file_to_upload:
    23    125.6 MiB     13.8 MiB         474               await blob_client.upload_blob(file_to_upload, length=os.path.getsize(TEST_FILE), overwrite=True)
    24    125.6 MiB      0.0 MiB         172           await blob_client.close()

And in total, the await asyncio.gather(*tasks) adds 24.9 MiB -

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    25     99.8 MiB     99.8 MiB           1   @profile
    26                                         async def run_multi_upload(n):
    27     99.8 MiB      0.0 MiB           1       tasks = []
    28     99.8 MiB      0.0 MiB         101       for i in range(n):
    29     99.8 MiB      0.0 MiB         100           tasks.append(upload(f"storage_client_memory/test_file_{i}"))
    30    124.7 MiB     24.9 MiB           2       await asyncio.gather(*tasks)

Additional context My app is running in a Kubernetes cluster as a side-car container and constantly uploads files from the cluster to our storage. I'm running the uploads in batches of 20 async tasks. After 30 minutes in which I uploaded ~30,000 files, the container reached 1.5Gb of memory consumption. This was as part of a stress test for the container.

Oct 25 '22 07:10 morpel

Hi @morpel - Thanks for the detailed report! We'll investigate asap!

Oct 25 '22 17:10 swathipil

Hi @morpel, just wanted to update that we are still investigating this and trying to reproduce the issue on our end. Will update when we know more. Thanks.

Nov 08 '22 19:11 jalauzon-msft

Hi again @morpel, we were able to reproduce your findings on our side and did some investigation to determine what was happening. While we are still investigating the root cause, we believe we have a recommendation that should solve this memory issue for you.

Ultimately, this seems to be caused by you creating a new client instance for each request. We usually recommend that you re-use a single client instance for your application and in this case, that should help with memory. Here is what that might look like in your sample:

import asyncio
from azure.storage.blob.aio import ContainerClient

async def upload(client: ContainerClient, storage_path):
    with open(TEST_FILE, 'rb') as file_to_upload:
        await client.upload_blob(storage_path, file_to_upload, length=os.path.getsize(TEST_FILE), overwrite=True)

async def run_multi_upload(n):
    async with ContainerClient.from_connection_string(conn_str=get_connection_string(), container_name=CONTAINER_NAME) as container_client:
        tasks = []
        for i in range(n):
            tasks.append(upload(container_client, f"storage_client_memory/test_file_{i}"))
        await asyncio.gather(*tasks)

if __name__ == '__main__':
    asyncio.run(run_multi_upload(100))

Note: I also switched to calling upload_blob directly from ContainerClient to avoid the creation of a separate BlobClient but that may or may not be necessary.

We are still investigating the reason why making a new client each time leads to increased memory usage, but part of it definitely seems to be related to Garbage Collection. Creating a new client each time leads to a lot of stale memory that is just waiting to be garbage collected and without high memory pressure, it may not always be garbage collected right away. We did see that forcing garbage collection did help but this wouldn't really be recommended in a production scenario. There does also seem to be another piece to the puzzle though that we are still investigating as forcing garbage collection didn't completely solve the issue.

Anyway, hopefully this recommendation to re-use your client instance can help in your scenario. Please give it a try and let us know how it goes. Thanks.

Nov 09 '22 21:11 jalauzon-msft

Thank you @jalauzon-msft . So how would you recommend using the ContainerClient in an app that is constantly running and uploading files to storage? Should I create the containerClient instance on startup and never close it? Otherwise I will have to recreate it for each upload and it seems that I will cause some memory issues as well

Nov 13 '22 07:11 morpel

@jalauzon-msft Can you also check back at #27320?

That example uses a single ContainerClient, and calls download_blob() directly without creating a BlobClient, yet it still gobbles gigabytes of memory for ~100k blobs.

Nov 14 '22 09:11 tboerstad

Hi @morpel, creating a single ContainerClient instance on startup and using it for all your requests should be find and would be the recommended approach. This should work across threads as well.

Nov 14 '22 19:11 jalauzon-msft

azure-sdk-for-python azure-sdk-for-python copied to clipboard

High memory usage in multiple async files upload

azure-sdk-for-python
azure-sdk-for-python copied to clipboard