s3-tar icon indicating copy to clipboard operation
s3-tar copied to clipboard

Unable to tar large files

Open huw0 opened this issue 2 years ago • 0 comments

Hi,

Firstly, thanks! S3-tar has been really useful for archiving some of our buckets.

We have a number that contain a mix of small files and files that are larger than 50% of available RAM. When a large file is encountered the process is killed with an out of memory error. It'd be really great to resolve this.

From a cursory look at the code it seems that the underlying cause of this is the use of io.BytesIO() for in-memory processing both when downloading from S3 and creating parts of the tar, meaning that any files being processed need RAM to be > fileSize*2.

I think multiple algorithms are required depending on file size. Where there are small files, the current process makes sense as caching reduces the total time needed.

However when a large file is encountered, it is probably necessary to pipe directly from the s3 stream to tar and back to s3. This would mean that there will be a limit of one part uploading at a time.

Alternatively some intelligent spooling to disk could be used although this would have the same problem that the maximum file size supported will be limited by the disk.

huw0 avatar Jun 14 '22 17:06 huw0