TileDB
TileDB copied to clipboard
Cap number of S3 upload parts at 10000
The maximum number of parts for an S3 multipart upload is 10,000: https://docs.aws.amazon.com/AmazonS3/latest/dev/qfacts.html
For large VFS write operations it's possible we will generate more than 10,000 parts. We should either add a clamp here: https://github.com/TileDB-Inc/TileDB/blob/279d3ca98c734bca7855f0f3dbd783d133145979/tiledb/sm/filesystem/s3.cc#L783
or reduce the number of parts by iteratively increasing the part size for the upload from its default of 5MB (from config param vfs.s3.multipart_part_size
) until the total number is < 10,000.
I first considered clamping the 'num_ops' at the line referenced in the issue description. This will not work because a single S3::write_multipart()
does not necessarily represent the entirety of a single multipart upload transaction. Consider the scenario where S3::write()
invokes S3::write_multipart()
twice. The first invocation may do all 10k writes, and the next invocation will completely fail. This will be detected as a failed transaction and the upload will abort.
I next considered moving up the stack into S3::write()
and adjusting the total multipart size based on the total length of the write. For example, if the multipart size was 1B and the total write size was 20KB, we would increase the multipart size to 2B. This also will not work, because a single S3::write()
may not represent the entirety of a single upload transaction. The S3::write()
would need to know the cumulative size of all writes that the user intends to perform before closing the "file" (and subsequently flushing and finishing the upload transaction).
I next considered forcefully finishing the upload transaction and starting a new one when S3::write()
will hit the 10k part limit. I don't like this because we may persist some of the S3:write()
s without persisting others. We should adhere to our contract of only persisting the file once we close (aka flush) the file, otherwise the application developer consuming our interface can not make reliable assumptions about crash consistency.
I believe the correct behavior would be to fail the S3::write()
if it will exceed the S3 limits. I think this is reasonable because 1) we're probably dealing with a really large file, and AWS has a 5TB limit anyway, and 2) give the user the flexibility to either increase the part size or break their data into multiple files.
Let me know what you think, @tdenniston.