Large bucket copies eat up all memory; How do deal with huge buckets?
Hi,
I'm copying from one bucket to another, hundreds of thousands of little files. I'm running this on a small box (a vagrant VM).
S3cmd (latest version, on Ubuntu) is eating up 3G memory and climbing. I leave it going overnight, and eventually I just get "retrying" errors. I suspect the swap drive runs out of disk space and then things go bad.
I would have thought copying from one bucket to another should consume very few resources on my local machine. Any advice?
It has to get each of the file lists, compare, and generate the list of "diffs" to operate on. We see out-of-memory errors above about 800,000 files frequently. The advice is to either break the job down into smaller pieces, or run on a 64-bit machine with more RAM.
On Mon, Oct 27, 2014 at 2:44 PM, Geoffrey Plitt [email protected] wrote:
Hi,
I'm copying from one bucket to another, hundreds of thousands of little files. I'm running this on a small box (a vagrant VM).
S3cmd (latest version, on Ubuntu) is eating up 3G memory and climbing. I leave it going overnight, and eventually I just get "retrying" errors. I suspect the swap drive runs out of disk space and then things go bad.
I would have thought copying from one bucket to another should consume very few resources on my local machine. Any advice?
— Reply to this email directly or view it on GitHub https://github.com/s3tools/s3cmd/issues/405.
why don't do just rsync? rsync in the past generated one big list of files and compared then... but later they changed (IIRC) to find each directory, compare the files in that directory, add transfer files to a queue that will be taken care by a forked process to process the real transfer and continue to the next directory
This way, unless you have a huge dir with huge amount of files, not only you get a much better memory usage, but also faster, as you are transmitting and comparing in parallel. On many thousand of files, this can cut the memory usage a lot and remove several hours of the sync time
a fake way to do this could be like this
find $dir/ -type d -exec s3cmd sync {}/* s3://$bucket/{}/ \;
but i didn't test to see what happen, specially for directory (like $dir/dir1/dir2/) and removed files
maybe adding one option like --one-level would help... this option would sync the files and directory names (not recurse) to the s3 bucket allowing this
find $dir/ -type d -exec s3cmd --one-level sync {}/ s3://$bucket/{}/ \;
The command
find $dir/ -type d -exec s3cmd sync {}/* s3://$bucket/{}/ \;
fails because sync recurses the tree by default, so it transfers everything... so instead of --one-level, how about --no-recursive? it seems a more logic option name
Does anyone have any solutions?