s3cmd icon indicating copy to clipboard operation
s3cmd copied to clipboard

Large bucket copies eat up all memory; How do deal with huge buckets?

Open GeoffreyPlitt opened this issue 11 years ago • 4 comments

Hi,

I'm copying from one bucket to another, hundreds of thousands of little files. I'm running this on a small box (a vagrant VM).

S3cmd (latest version, on Ubuntu) is eating up 3G memory and climbing. I leave it going overnight, and eventually I just get "retrying" errors. I suspect the swap drive runs out of disk space and then things go bad.

I would have thought copying from one bucket to another should consume very few resources on my local machine. Any advice?

GeoffreyPlitt avatar Oct 27 '14 19:10 GeoffreyPlitt

It has to get each of the file lists, compare, and generate the list of "diffs" to operate on. We see out-of-memory errors above about 800,000 files frequently. The advice is to either break the job down into smaller pieces, or run on a 64-bit machine with more RAM.

On Mon, Oct 27, 2014 at 2:44 PM, Geoffrey Plitt [email protected] wrote:

Hi,

I'm copying from one bucket to another, hundreds of thousands of little files. I'm running this on a small box (a vagrant VM).

S3cmd (latest version, on Ubuntu) is eating up 3G memory and climbing. I leave it going overnight, and eventually I just get "retrying" errors. I suspect the swap drive runs out of disk space and then things go bad.

I would have thought copying from one bucket to another should consume very few resources on my local machine. Any advice?

— Reply to this email directly or view it on GitHub https://github.com/s3tools/s3cmd/issues/405.

mdomsch avatar Oct 27 '14 20:10 mdomsch

why don't do just rsync? rsync in the past generated one big list of files and compared then... but later they changed (IIRC) to find each directory, compare the files in that directory, add transfer files to a queue that will be taken care by a forked process to process the real transfer and continue to the next directory

This way, unless you have a huge dir with huge amount of files, not only you get a much better memory usage, but also faster, as you are transmitting and comparing in parallel. On many thousand of files, this can cut the memory usage a lot and remove several hours of the sync time

a fake way to do this could be like this

   find $dir/ -type d -exec s3cmd sync {}/* s3://$bucket/{}/ \;

but i didn't test to see what happen, specially for directory (like $dir/dir1/dir2/) and removed files

maybe adding one option like --one-level would help... this option would sync the files and directory names (not recurse) to the s3 bucket allowing this

    find $dir/ -type d -exec s3cmd --one-level sync {}/ s3://$bucket/{}/  \;

danielmotaleite avatar Jul 14 '15 19:07 danielmotaleite

The command

          find $dir/ -type d -exec s3cmd sync {}/* s3://$bucket/{}/ \;

fails because sync recurses the tree by default, so it transfers everything... so instead of --one-level, how about --no-recursive? it seems a more logic option name

danielmotaleite avatar Jul 14 '15 19:07 danielmotaleite

Does anyone have any solutions?

ComBin avatar May 21 '24 16:05 ComBin