nerd Research archiving format for datasets prior to uploading to S3

During the initial dataset implementation it became clear that zipping the file before uploading it to an object store has two downsides:

CON: both upload and download require a temporary file to write the zip file prior to uploading, on large datasets this can be a problem as there might not be enough diskspace for double dataset size
CON: it is hard to merge two datasets or get single files from the dataset without downloading the whole zip
CON: limited to 5T zip files
PRO: enables dataset wide compression
PRO: partial dataset uploads are not possible, S3 writes the object or not

There are two alternatives, use tar as the archiving format. This would solve the temporary file issues but keeps the 5TB limit but doesn't make it easier to merge or download part of the dataset.

Another alternative is to not archive datasets into a single file but instead upload each file as a separate object to s3. This would change the size limit to 5TB PER FILE and make it super easy to only download a single file. But it would not allow compression and it is now easier to have partial uploads

The latter would allow the use of syncing logic that only moves changes to and from s3, this has the CON that we need to provide access to the ListObjects API on the public bucket

Jan 31 '18 14:01 advdv

It seems that ZIP64 meets most of the requirements. It allows for streaming compression and random access decompression because file offsets are contained. The format itself allows for files and archives up to 16 exabytes and we may be able to use multipart zip files to exceed the 5TB limit. I think this basically just means splitting up the zip file in 5TB chunks.

The following Go library, which is a fork of archive/zip supports these features:

https://github.com/sourcegraph/lazyzip

The downside is that it uses per-file compression.

Feb 08 '18 15:02 Overv

I'm proposing that due to time constraints we do not implement a multi-object archiving format for now. Functionality has to be settled at the end of this week and this topic requires more research. Instead we'll suffix each object with a .tar to indicate its format such that we can easily switch formats later on. And thanks for the lazyzip link @Overv , this inspired me to also formulate the archiving/compression implementation with an interface so we can more easily swap on a code level.

I'm moving this to the next milestone, refactoring of the current codebase to .tar happens at #284

Feb 12 '18 08:02 advdv

nerd nerd copied to clipboard

Research archiving format for datasets prior to uploading to S3

nerd
nerd copied to clipboard