s3-tar
s3-tar copied to clipboard
Unable to tar large files
Hi,
Firstly, thanks! S3-tar has been really useful for archiving some of our buckets.
We have a number that contain a mix of small files and files that are larger than 50% of available RAM. When a large file is encountered the process is killed with an out of memory error. It'd be really great to resolve this.
From a cursory look at the code it seems that the underlying cause of this is the use of io.BytesIO()
for in-memory processing both when downloading from S3 and creating parts of the tar, meaning that any files being processed need RAM to be > fileSize*2.
I think multiple algorithms are required depending on file size. Where there are small files, the current process makes sense as caching reduces the total time needed.
However when a large file is encountered, it is probably necessary to pipe directly from the s3 stream to tar and back to s3. This would mean that there will be a limit of one part uploading at a time.
Alternatively some intelligent spooling to disk could be used although this would have the same problem that the maximum file size supported will be limited by the disk.