s4cmd 2.0.1 leaks memory for large uploads
Hi there:
We are using s4cmd to push, in parallel, about 70k files into an S3 bucket. The files vary in size from very small (10s of bytes) to somewhat large (11GB). On a recent run, s4cmd consumed nearly 80GB of memory before being caught by the Linux OOM killer.
To test, I created a smaller pool of 2,500 and 1,000 files and modified s4cmd to sleep before exiting so we could get an accurate RSS measurement.
The 2,500 file pool was 6.9GB on disk. The 1,000 file pool was 2.5GB on disk.
After successfully uploading both pools, RSS for the s4cmd process was roughly equal to the total size of files uploaded:
The 2.5GB pool ended up with an RSS of 2.7GB.
The 6.1GB pool ended up with an RSS of 7.3GB.
The size of the files being uploaded, not the total number of files, is the driving factor. Uploading 5,000 2 byte files only drives the RSS to .16GB (~167MB).
Specifying a very small max-singlepart-upload-size (for example, 1048576 bytes) does not change the memory usage.
After bisecting the upload() method in s4cmd.py, it appears the leak or cycle is happening inside of the boto3 put_object() call. Commenting out the put_object() keeps memory usage stable.
Importing and periodically calling gc.collect() also keeps memory usage in check.
I have not put together a test case just using boto3 to see if the leak occurs there. But I can't be the only person who would like to use s4cmd to upload several hundreds of GB of files.
Is there already a known workaround for this problem?
The environment is:
python 2.7.12 s4cmd 2.0.1 boto3 1.4.4 botocore 1.5.24
We built two test scripts to try and narrow down the leak and to determine whether the problem was in boto3 or a result of how boto3 was being called by s4cmd.
The first script calls boto3's put_object() directly. The second imports BotoClient from s4cmd.
In both cases, we ran the scripts against the same 1,000 and 2,500 file corpuses that were causing leak with s4cmd. Neither script leaked, so it appears to be a cycle that's being created within s4cmd somehow, not an outright bug in boto3.
Two scripts follow:
#!/bin/env python
import boto3 import botocore import os
testdir = '/tmp/testdir' client = boto3.client('s3')
files = os.listdir(testdir)
for file in files: client.put_object(ACL='private', Body=open('%s/%s' % (testdir, file), 'rb'), Bucket="wawd-s4cmd-test-bucket", Key="%s/%s" % ("testdir", file))
#!/bin/env python
from s4cmd import BotoClient import os
testdir = '/tmp/testdir'
client = BotoClient({})
files = os.listdir(testdir)
for file in files: client.put_object(ACL='private', Body=open('%s/%s' % (testdir, file), 'rb'), Bucket="wawd-s4cmd-test-bucket", Key="%s/%s" % ("testdir", file))
I'm experiencing the same issue when trying to sync a 270GB directory to a bucket.
It will run a dry run without any issue but won't actually sync more than a few GB.
The problem is still happening. is there any fix ?
For me reducing the thread count (-c 2 ) seemed to stop it. YMMV etc. This was on a small VM with under 2GB of RAM.
For me -c 2 didn't helped
s4cmd not able to upload 2.2GB archive with about 500MB of RAM.
But helped this two option(actually only second one, but I wanted to be sure about what size of piуce will be used)
--multipart-split-size=100000000 --max-s inglepart-upload-size=100000000
as I understand default piece size is 50MB - I checked this commit