s3committer
s3committer copied to clipboard
Multipart upload of one large file instead of a directory with per-partition files
Just wondering if my interpretation of some relevant docs and the code here is correct:
-
it seems like uploading a single large file in parts is possible / supported by S3 (docs):
In a distributed development environment, it is possible for your application to initiate several updates on the same object at the same time. Your application might initiate several multipart uploads using the same object key. For each of these uploads, your application can then upload parts and send a complete upload request to Amazon S3 to create the object
-
However, my reading of the code here is that s3committer doesn't support this; it only cares about jobs that output a directory containing one output file per task.
- I assume the same is true of the upstreamed support slated for Hadoop 3.1.0.
Is that all correct?
Are you asking whether this can stitch together task outputs into a single file?
yea that is a good summary, thanks
also wanted to leave google-breadcrumbs since I've been in discussions recently about parallel-writing large single files (e.g. HDF5) in cloud stores using multipart-upload APIs
this repo is the most sophisticated use of multipart uploads in that general direction that I've seen, so am also j/w if that use-case has come up / been discussed/vetted here or anywhere else
I think most formats would be difficult to write with a multi-part API. You'd have to use structures within the file format to support this.
An Avro flie, for example, could be created using this method if each task used the same schema and produced self-contained Avro blocks (1MB chunks of encoded data starting with a sync marker). You wouldn't be able to do this with Parquet though because the file footer needs to encode where each row group starts in the overall file.
Simple cases like CSV and JSON would work when uncompressed, but if you want to add compression (and you certainly do) then you'd have to support concatenated compression blocks at any point in the file, which ruins splittabliity.
I'm not sure this approach would work well enough in general to pursue. Maybe if you designed a format around it or had a special writer for a format like Parquet that could handle this.
I'm not sure this approach would work well enough in general to pursue
I agree, just wanted to do diligence on the tradeoffs and prior art.
Also, worth noting that the value of ["hacks" like you sketched above] is dependent on how locked-in a given community is to working with single-large-file formats.
I've been researching this through the lens of the single-cell-sequencing world, where HDF5 is very prominent today, but there is a fair amount of momentum to use cloud-friendlier formats (and I don't think the HDF5 lock-in is so severe; Zarr is an emergent front-runner), so my expectation is that no one will feel compelled to (mis)use multipart-uploads in this way.
FWIW, here are some notes I took a few months ago about the various constraints on a single-file multipart-upload approach on GCP, AWS, and Azure.
A notable one is that GCP only supports 32-wide multipart-uploads, but you can do two layers of 32, for a total of 1024, which is right in a weird middle-ground where a lot of use-cases could be fine for a while with ≤1024 partitions (stitched together awkwardly in two layers of 32), if a library that did that existed.
A couple other inlines, just for grins:
You wouldn't be able to do this with Parquet though because the file footer needs to encode where each row group starts in the overall file.
If you had to, it seems like you could write all the blocks, then do a scanLeft / cumulative-sum over them to compute the necessary offsets, then write the footer; 3 parallel jobs instead of 1 non-parallel job could still be a big improvement (the "3" is O(1) in this case 😄)
Simple cases like CSV and JSON would work when uncompressed, but if you want to add compression (and you certainly do) then you'd have to support concatenated compression blocks at any point in the file, which ruins splittabliity.
AFAIK, block-gzipping would let you pull this off here (concatenating a bunch of gzips is a valid gzip), unless I'm misunderstanding.
But anyway, agreed that I don't think anyone actually wants to or should do this atm.
(Feel free to close, thanks!)
If you had to, it seems like you could write all the blocks, then do . . .
Yeah, you could do it with special writers, it just requires support in the file format, not the committer level. You could do something like this in a system like Iceberg, but I wouldn't want to do it in committers.
AFAIK, block-gzipping would let you pull this off here (concatenating a bunch of gzips is a valid gzip), unless I'm misunderstanding.
It would be a valid file, but it's non-trivial to jump to an offset and find the next gzip block to start scanning. It's doable but ugly like splitting quoted CSV. So essentially this is not reliably splittable, which is critical for any large-scale use.