boto3 icon indicating copy to clipboard operation
boto3 copied to clipboard

Excessive memory usage on multithreading

Open jbvsmo opened this issue 6 years ago • 34 comments

I have been trying to debug a "memory leak" in my newly upgraded boto3 application. I am moving from the original boto 2.49.

My application starts a pool of 100 thread and every request is queued and redirected to one of these threads and usual memory for the lifetime of the appication was about 1GB with peaks of 1.5GB depending of the operation.

After the upgrade I added one boto3.Session per thread and I access multiple resources and clients from this session which are reused throughout the code. On previous code I would have a boto connection of each kind per thread (I use several services like S3, DynamoDB, SES, SQS, Mturk, SimpleDB) so it is pretty much the same thing.

Except that each boto3.Session alone uses increases memory usage immensely and now my application is running on 3GB of memory instead.

How do I know it is the boto3 Session, you ask? I created 2 demo experiments with the same 100 threads and the only difference on both is using boto3 in one and not on the other.

Program 1: https://pastebin.com/Urkh3TDU Program 2: https://pastebin.com/eDWPcS8C (Same thing with 5 lines regarding boto commented out)

Output program 1 (each print happens 5 seconds after the last one):

Process Memory: 39.4 MB
Process Memory: 261.7 MB
Process Memory: 518.7 MB
Process Memory: 788.2 MB
Process Memory: 944.5 MB
Process Memory: 940.1 MB
Process Memory: 944.4 MB
Process Memory: 948.7 MB
Process Memory: 959.1 MB
Process Memory: 957.4 MB
Process Memory: 958.0 MB
Process Memory: 959.5 MB

Now with plain multiple threads and no AWS access. Output program 2 (each print happens 5 seconds after the last one):

Process Memory: 23.5 MB
Process Memory: 58.7 MB
Process Memory: 58.7 MB
Process Memory: 58.7 MB
Process Memory: 58.7 MB
Process Memory: 58.7 MB
Process Memory: 58.7 MB
Process Memory: 58.7 MB
Process Memory: 58.7 MB
Process Memory: 58.7 MB

Alone the boto3 session object is retaining 10MB per thread in a total of about 1GB. This is not acceptable from an object that should not be doing much more than requesting stuff to the AWS servers only. It means that the Session is keeping lots of unwanted information.

You could be wondering if it is not the resource that is keeping live memory. If you move the resource creation to inside the for loop, the program will also hit the 1GB in the exact the same 15 to 20 seconds of existence.

In the beginning I tried garbage collecting for cyclic references but it was futile. The decrease in memory was only a couple megabytes.

I've seen people complaining on botocore project on something similar (maybe not!), so it might be a shared issue. https://github.com/boto/botocore/issues/805

jbvsmo avatar Aug 21 '18 12:08 jbvsmo

I forgot to mention that I added cyclic garbage collection into the 5 second loop that will display the memory. If this is removed, the memory will increase even more (and it doesn't seem to stop) which means someone is also leaking circular references.

Now I noticed something even worse. If I create a new session inside the loop, the memory usage will be even higher, even with the garbage collection in place.

This program I linked is simple enough and yet the memory issues are so visible I'm wondering if no one saw it before or if maybe this is related to some recent boto version.

boto3: 1.7.71 botocore: 1.10.71

Program output https://pastebin.com/Nm4dWPKJ :

Process Memory: 23.6 MB
Process Memory: 234.5 MB
Process Memory: 470.6 MB
Process Memory: 719.7 MB
Process Memory: 994.3 MB
Process Memory: 1144.7 MB
Process Memory: 1129.9 MB
Process Memory: 1160.5 MB
Process Memory: 1222.5 MB
Process Memory: 1200.5 MB
Process Memory: 1176.4 MB
Process Memory: 1173.8 MB
Process Memory: 1200.2 MB
Process Memory: 1342.9 MB
Process Memory: 1341.3 MB

jbvsmo avatar Aug 21 '18 14:08 jbvsmo

Some more investigation (sorry for so much noise):

  • There are probably two issues here:
    1. Memory leak on any Python version and any boto version when Sessions are created inside a loop
    2. Very high memory usage on Python 2.7 and high on 3.7 (but acceptable)

I was initially only testing Python 2.7.15, but now that I also ran the program on Python 3.7.0 the memory usage is about half (500MB) with or without cyclic garbage collection, which is great.

On Python 3, the leak still happens if I create the session within the for loop on every thread! Just the increase in memory is slower this time.

I decided to test older boto versions (from boto3 1.0 to 1.7) with Python 2.7 and they all show the leaking pattern when session is created inside a loop, BUT on boto 1.5 and lower memory usage is 100 MB lower and on boto 1.2 and lower memory takes 2 minutes to reach that value instead of 20 seconds.

I noticed that if I explicitly do del s3 the memory will go down to 200 to 300MB total, which is super crazy. No python code is expected to run del since the reference count should be taking care of stuff but probably isn't!!

I cannot do this in my code since I need to reuse resources and I'm starting to be out of options...

jbvsmo avatar Aug 21 '18 20:08 jbvsmo

At first I thought this might be related to https://github.com/boto/botocore/issues/1248 which is the only confirmed leak I know of.

However, looking into this it seems to me that this is related to the client/resource object. That being said this isn't a memory leak, the reason you're seeing the ramp up in memory is that each time you create a session/client we have to go to disk to load the JSON models representing the service, etc. There's so much contention on a single file it takes ~20-30 seconds to even instantiate all 100 session/clients, considering each session has its own cache I'm actually not all that surprised by these memory usage numbers.

I would suggest doing something like this:

def run_pool(size):
    ts = []
    session = boto3.Session()
    for x in range(size):
        s3 = session.resource('s3')
        t = Boto3Thread(s3)
        t.start()
        ts.append(t)
    return ts

This way you only instantiate one session, and can actually leverage the caching that the session provides to instantiate all 100 resource objects to give to each thread.

joguSD avatar Aug 23 '18 17:08 joguSD

@joguSD Sorry but that doesn't explain why memory is not being released even with cyclic garbage collection in place.

And the most strange part is if I do del s3 at the end of the for loop, a really good chunk of memory will go away. No python code is really expect to "free" resources.

The code you provided is very alike from what I ended up using in my application but besides the 100 threads with the same lifespan as the program, I occasionally run other things in parallel on other threads and those add to total memory that is never freed!

After a few days, my 8GB server is out of memory again. This is at least poor memory management on boto3 part. The only solution I am seeing is revert 2 weeks of porting code and go back to boto 2.49

jbvsmo avatar Aug 23 '18 18:08 jbvsmo

We also started experiencing this. Here's a quick output of py-spy. Note the excessive thread count (and corresponding memory usage).This code was stable for years and dozens of boto3/botocore versions. There's clearly something buggy in the transition to urllib3. The app code here is using the s3 transfer service to upload a few files.

Its also worth noting, this app isn't using threads where this triggers, the thread usage are entirely from s3transfer library. py 2.7.15, version freeze below

GIL: 0.00%, Active: 100.00%, Threads: 711

  %Own   %Total  OwnTime  TotalTime  Function (filename:line)                                                                                                         
100.00% 100.00%   10.44s    10.44s   ssl_wrap_socket (urllib3/util/ssl_.py:336)
  0.00% 100.00%   0.000s    10.44s   _make_request (urllib3/connectionpool.py:343)
  0.00% 100.00%   0.000s    10.44s   _execute_main (s3transfer/tasks.py:150)
  0.00% 100.00%   0.000s    10.44s   _send (botocore/endpoint.py:215)
  0.00% 100.00%   0.000s    10.44s   __bootstrap (threading.py:774)
  0.00% 100.00%   0.000s    10.44s   send (botocore/httpsession.py:242)
  0.00% 100.00%   0.000s    10.44s   _make_api_call (botocore/client.py:599)
  0.00% 100.00%   0.000s    10.44s   __bootstrap_inner (threading.py:801)
  0.00% 100.00%   0.000s    10.44s   run (concurrent/futures/thread.py:63)
  0.00% 100.00%   0.000s    10.44s   urlopen (urllib3/connectionpool.py:600)
  0.00% 100.00%   0.000s    10.44s   __call__ (s3transfer/tasks.py:126)
  0.00% 100.00%   0.000s    10.44s   _main (s3transfer/upload.py:692)
  0.00% 100.00%   0.000s    10.44s   make_request (botocore/endpoint.py:102)
  0.00% 100.00%   0.000s    10.44s   _send_request (botocore/endpoint.py:146)
  0.00% 100.00%   0.000s    10.44s   _get_response (botocore/endpoint.py:173)
  0.00% 100.00%   0.000s    10.44s   _worker (concurrent/futures/thread.py:75)
  0.00% 100.00%   0.000s    10.44s   run (threading.py:754)
  0.00% 100.00%   0.000s    10.44s   _validate_conn (urllib3/connectionpool.py:849)
  0.00% 100.00%   0.000s    10.44s   connect (urllib3/connection.py:356)
  0.00% 100.00%   0.000s    10.44s   _api_call (botocore/client.py:314)
# pip freeze
argcomplete==1.9.4
boto3==1.8.7
botocore==1.11.7
certifi==2018.8.24
chardet==3.0.4
click==6.7
decorator==4.3.0
docutils==0.14
functools32==3.2.3.post2
futures==3.2.0
idna==2.7
jmespath==0.9.3
jsonpatch==1.23
jsonpointer==2.0
jsonschema==2.6.0
python-dateutil==2.7.3
PyYAML==3.13
requests==2.19.1
s3transfer==0.1.13
simplejson==3.16.0
six==1.11.0
tabulate==0.8.2
urllib3==1.23
virtualenv==16.0.0
websocket-client==0.52.0

kapilt avatar Sep 07 '18 10:09 kapilt

@kapilt Considering the original issue was raised before the urllib3 changes were released I'm not sure if what you're experiencing is related. In my original analysis of this issue actually carrying out an API call or not didn't make a difference and had everything to do with instantiating 100 different sessions.

joguSD avatar Sep 07 '18 18:09 joguSD

@joguSD that's fair, re-reading its not entirely clear its the same ilk. i'll file as a separate issue after some more analysis and differential to the urllib3 change along and checking s3 transfer parameters to not use threads. fwiw, we do create a bunch of sessions as well but all are out of scope here and free to be gc'd.

kapilt avatar Sep 08 '18 11:09 kapilt

@joguSD same problem here ! using boto3 to upload about 30000 little files, then i used multiprocess to fork about 30 pools, the memory increase from 1GB to 6GB immediately

yangkang55 avatar Sep 11 '18 08:09 yangkang55

confirm, just a simple creation of a boto3.session in threads/async handlers lead to extensive memory usage, that's is not freed at all (gc.collect() doesn't help too)

maybeshewill avatar Nov 01 '18 15:11 maybeshewill

Fwiw at least for my app switching s3transfer to not use threads resolved a lot of issues wrt to memory.

kapilt avatar Nov 01 '18 15:11 kapilt

Hi, we also hit the same problem. The memory keeps increasing and doesn't get released. I tried patching some of the AWS code (including the caching decorators so that they don't cache), manually clearing the loader cache, and adjusting the model loaders not to load the documentation. I noticed as well that the session has a register function, but unregister isn't called, so I kept track of the registered objects and called that too (not sure if that makes a difference). That seemed to bring down the memory, at the expense of caching, but I didn't notice any speed difference. Any feedback or ideas from the AWS team about this?

wdiu avatar Jan 22 '19 21:01 wdiu

I'm too experiencing this issue with se S3 Boto client. Reading bucket objects keeps the memory usage pretty well, but writing them with put_object() incurs in growing usage of RAM.

Gloix avatar Jan 23 '19 17:01 Gloix

We have noticed this problem too. We are using this in the backend of a flask web application. By nature, the web application is multithreaded. So we cannot instantiate just one session globally in the app. @joguSD I have noticed a few things, correct me if I am wrong:

  1. boto3/botocore loads the entire json files: services-2.json, resources-1.json, paginators-1.json, endpoints.json, _retry.json etc in memory. Although these files are lazily loaded, is loading the entire file necessary? For example, when a ServiceModel is created for EC2, that file is 23000+ lines long. If I want to just call 1/2 APIs on EC2, then I don't need a ServiceModel that contains all of EC2's APIs and Shapes. Is it possible to just create a service model for the APIs/Objects that I am interested in? For example, when I create the ec2 client, I can pass in the service names that I am interested in as a list.
  2. The JSON files contain documentation. Whenever these files are cached in memory, these documentation strings are also being cached, which is not necessary if I don't use them. The same applies when Client classes are dynamically created and function docstrings are assigned to them. It is kind of dynamically creating the source code in memory, however, for using the methods, the docstrings are not necessary. I think these should be optional too.

antonbarua avatar Jan 23 '19 22:01 antonbarua

@antonbarua I suppose something like that might be possible, but it might not be all that practical. Stripping the model down isn't as simple as just keeping the operations you want to use. You'd have to figure out what shapes are needed and which are orphaned and then remove them.

The documentation is there for tools built on top of botocore like the AWS CLI, but from the pure SDK perspective I could see why you wouldn't want this. If you were really inclined you could do a tree-shake of sorts on the model stripping it down to what you need and placing it in ~/.aws/models to be used instead.

joguSD avatar Feb 13 '19 00:02 joguSD

Hi @joguSD, if trimming the model to keep only the desired operations isn't practical, what do you think about having an option that disables the cache and calling unregister as described in https://github.com/boto/boto3/issues/1670#issuecomment-456570780 ? The initial memory consumption may stay the same, but at least it won't keep growing (i.e. it stops the memory leak).

wdiu avatar Feb 13 '19 14:02 wdiu

Is there any wokaround while the fix is on the way :( ? Edit: "gc.collect()" alleviate the problem a bit. Thanks!

johnyoonh avatar Apr 13 '19 05:04 johnyoonh

Hi @Gloix ,

Initial s3 session with static variable can fix the memory leak situation.

import threading
import boto3
import os
import base64
import time
import random
import psutil

BUCKET = '' # <--- YOUR BUCKET NAME HERE

MIN_WAIT = 1
MAX_WAIT = 20


class Boto3Thread(threading.Thread):
    daemon = True
    is_running = True
    __s3_client = boto3.client('s3', region_name='us-east-1')

    def run(self):
        path = 'test_boto/'
        while self.is_running:
            file_name = path + 'file_' + str(random.randrange(100000))
            content = base64.b64encode(os.urandom(100000)).decode()

            self.__s3_client.put_object(
                Bucket=BUCKET,
                Key=file_name,
                Body=content,
                ContentType='text/plain'
            )
            if not self.is_running:
                # Avoid an useless sleep cycle
                break

            sleep_duration = random.randrange(MIN_WAIT, MAX_WAIT)
            #print('{} will sleep for {} seconds'.format(self.name, sleep_duration))
            time.sleep(sleep_duration)

def check_memory():
    import gc
    gc.collect()
    process = psutil.Process(os.getpid())
    return process.memory_info().rss / 1024. / 1024.

def run_pool(size):
    ts = []
    for x in range(size):
        t = Boto3Thread()
        t.start()
        ts.append(t)
    return ts

def stop_pool(ts):
    for t in ts:
        t.is_running = False
    for t in ts:
        t.join()

def main():
    ts = run_pool(100)
    try:
        while True:
            print('Process Memory: {:.1f} MB'.format(check_memory()))
            time.sleep(5)
    except KeyboardInterrupt:
        pass
    finally:
        print('Wait for all threads to finish. Should take about {} seconds!'.format(MAX_WAIT))
        stop_pool(ts)

main()

kaochiuan avatar May 22 '19 11:05 kaochiuan

Sorry I'm late to the party, but @joguSD may I ask about the suggestion you made (quoted below)?

I would suggest doing something like this:

def run_pool(size):
    ts = []
    session = boto3.Session()
    for x in range(size):
        s3 = session.resource('s3')
        t = Boto3Thread(s3)
        t.start()
        ts.append(t)
    return ts

This way you only instantiate one session, and can actually leverage the caching that the session provides to instantiate all 100 resource objects to give to each thread.

I'm asking because according to https://boto3.amazonaws.com/v1/documentation/api/latest/guide/resources.html#multithreading-multiprocessing it is not recommended for multiple threads to share a session. So, if I have ten threads making separate S3 requests, should they share a session or not?

yjhouzz avatar Aug 22 '19 21:08 yjhouzz

@yjhouzz The documentation you linked to states (emphasis mine):

It is recommended to create a resource instance for each thread / process in a multithreaded or multiprocess application

The resource in the code snippet is not shared, just the session is.

irgeek avatar Aug 26 '19 02:08 irgeek

@irgeek read further:

In the example above, each thread would have its own Boto 3 session and its own instance of the S3 resource.

Read issue https://github.com/boto/botocore/issues/1246 for more info.

longbowrocks avatar Oct 16 '19 20:10 longbowrocks

I’ve started to use boto3 in a flask application and got the ´cannot allocate memory’ error. Is there any update in this issue and some best practices to use boto3 with flask ?

lucj avatar May 30 '20 21:05 lucj

any solution to this ? My code is very simple but i have memory leaks. Even with max_concurrency to 1, i have memory leaks. The default is 10 btw. Any help?

I am trying to download 50GB file.

session = boto3.session.Session(
    aws_access_key_id=‘abc’,
    aws_secret_access_key=‘def’,
)
conn = session.resource("s3")
conn.Bucket('mybucket').download_file(
        Filename=download_path + key.split("/")[-1],
        Key=key,
        Callback=print_status,
        Config=boto3.s3.transfer.TransferConfig(
            max_concurrency=1,
            multipart_chunksize=CHUNK_SIZE,
            io_chunksize=CHUNK_SIZE
        )
    )

bsikander avatar Jun 03 '20 22:06 bsikander

I was having the same issue (Flask+boto3+AWS Elastic beanstalk) and it crashed the server for multiple times due to out of memory exception. I tried gc.collect() and other methods and all don't seem to work.

Eventually I figured out that I've to run the function(that uses boto3) separately in a different process(separate python script), so that when the sub-process terminated it also free the memory.

import subprocess

cmd_params = ['python3',F'{os.getcwd()}/run_task.py', 'config.json', 'param1', 'param2']
p = subprocess.Popen(cmd_params, stdout=subprocess.PIPE)
out = p.stdout.read()
output = out.decode("utf-8")

The method is not elegant and it's just a workaround, but it works though.

bktan81 avatar Jun 29 '20 12:06 bktan81

I do observe the same issue in a slightly different context when downloading larger files (10GB+) in Docker containers with a hard limit on memory, with a single boto3 session and no multithreaded invocation of Object.download_file (the code is very similar to https://github.com/boto/boto3/issues/1670#issuecomment-638486961).

In some cases I can also observe the same error as mentioned in https://github.com/boto/boto3/issues/1670#issuecomment-636389361:

1594222584953   File "/opt/amazon/lib/python3.6/site-packages/s3transfer/utils.py", line 364, in write
1594222584953     self._fileobj.write(data)
1594222584953 OSError: [Errno 12] Cannot allocate memory

It seems that disabling threading in boto3.s3.transfer.TransferConfig (use_threads=False) helps to some extent, but the occasional OSError still pops up.

From what I observed so far the most reliable mitigation for me was to reduce the multipart chunk size (multipart_chunksize e.g. to 1MB).

pler avatar Jul 10 '20 09:07 pler

Has anyone found a workaround for an application like Flask where one session cannot be instantiated globally?

cschloer avatar Jul 17 '20 13:07 cschloer

@cscholer cache and reuse sessions. A thread-local cash is fine. That way you won't create way too many sessions.

longbowrocks avatar Jul 17 '20 14:07 longbowrocks

@cschloer @longbowrocks I created this issue 2 years ago and the situation is unchanged since. My solution at the time which is running today on hundreds of servers I have deployed is exactly that of a local cache that I add to the current thread object.

Below is the code I use (slightly edited) to replace the resource and client boto 3 functions that is thread safe and does not need to explicitly create sessions and your code doesn't need to be aware it is inside a separate thread. You might need to do some cleanup to avoid open file warnings when terminating threads.

There are limitations to this and I offer no guarantees. Use with caution.

import json
import hashlib
import time
import threading
import boto3.session

DEFAULT_REGION = 'us-east-1'
KEY = None
SECRET = None


class AWSConnection(object):
    def __init__(self, function, name, **kw):
        assert function in ('resource', 'client')
        self._function = function
        self._name = name
        self._params = kw

        if not self._params:
            self._identifier = self._name
        else:
            self._identifier = self._name + hash_dict(self._params)

    def get_connection(self):
        thread = threading.currentThread()

        if not hasattr(thread, '_aws_metadata_'):
            thread._aws_metadata_ = {
                'age': time.time(),
                'session': boto3.session.Session(),
                'resource': {},
                'client': {}
            }

        try:
            connection = thread._aws_metadata_[self._function][self._identifier]
        except KeyError:
            connection = create_connection_object(
                self._function, self._name, session=thread._aws_metadata_['session'], **self._params
            )
            thread._aws_metadata_[self._function][self._identifier] = connection

        return connection

    def __repr__(self):
        return 'AWS {0._function} <{0._name}> {0._params}'.format(self)

    def __getattr__(self, item):
        connection = self.get_connection()
        return getattr(connection, item)


def create_connection_object(function, name, session=None, region=None, **kw):
    assert function in ('resource', 'client')
    if session is None:
        session = boto3.session.Session()

    if region is None:
        region = DEFAULT_REGION

    key, secret = KEY, SECRET

    # Do not set these variables unless they were configured on parameters file
    # If they are not present, boto3 will try to load them from other means
    if key and secret:
        kw['aws_access_key_id'] = key
        kw['aws_secret_access_key'] = secret

    return getattr(session, function)(name, region_name=region, **kw)


def hash_dict(dictionary):
    """ This function will hash a dictionary based on JSON encoding, so changes in
        list order do matter and will affect result.
        Also this is an hex output, so not size optimized
    """
    json_string = json.dumps(dictionary, sort_keys=True, indent=None)
    return hashlib.sha1(json_string.encode('utf-8')).hexdigest()


def resource(name, **kw):
    return AWSConnection('resource', name, **kw)


def client(name, **kw):
    return AWSConnection('client', name, **kw)

jbvsmo avatar Jul 17 '20 15:07 jbvsmo

Really appreciate the (very) quick and thorough response @jbvsmo

You're solution mostly worked for me - I combined it with simply reducing the number of processes in my UWSGI config - I think I was expecting too much from my tiny (1GB memory) server so I reduced the # of processes from 10 to 5.

cschloer avatar Jul 17 '20 15:07 cschloer

This is totally crazy. s3 client session drain out all our memory resources.

orShap avatar Dec 29 '21 17:12 orShap

We ran into this problem today. The memory leak was crashing our servers

foolishhugo avatar Nov 01 '22 16:11 foolishhugo