TileDB
TileDB copied to clipboard
Various improvements to the GCS VFS.
See each commit message for more details. This PR includes a fix for SC-26982
TYPE: IMPROVEMENT DESC: Various improvements to the Google Cloud Storage VFS
This pull request has been linked to Shortcut Story #26982: Remove polling for bucket/object propagation/deletion from the GCS VFS..
CI fails because because of composing failures. Will investigate.
Are we sure we want Remove polling for bucket/object propagation/deletion.? For multi-part uploads we've learned with GCS that they don't guarantee read-after-write, as such I think we need the wait for object to propagate?
Are you referring to this sentence of the docs? We are not using this API, but good question, it its not explicitly mentioned if object composition is strongly consistent. I submitted feedback on the docs.
a fix for SC-26982
That issue does not report a defect. What's there to fix? A putative performance improvement is not a defect.
The GCS docs are incomplete and ambiguous about the relationship between multipart uploads and operations that are strongly consistent. The documentation about strong consistency refers only to a subset of operations, and it does not address issues of mutual consistency between these classes of operations. Therefore, we must assume that multipart uploads and (say) listing are not mutual consistent. In absence of a clear and definitive statement that states the relationship, we must not assume that the system has the properties we might wish it to have.
The mere existence of CompleteMultipartUpload as a GCS request is a strong signal that multipart uploads do not have the strong consistency required. So removing polling without replacing it with a different mechanism for determining completion is a Very Bad Idea. It's going to lead to occasional loss of data. Integrity of data is paramount. We may not jeopardize it.
Recommend closing this PR and not returning to until we can measure the performance improvement that might be gained for what looks to be a rather large risk.
The mere existence of CompleteMultipartUpload as a GCS request is a strong signal that multipart uploads do not have the strong consistency required.
CompleteMultipartUpload is necessary because the multipart API does not require pre-specifying the total size of the final assembled object (it allows appending up to 10,000 parts before finalizing). See: https://docs.aws.amazon.com/AmazonS3/latest/API/API_CreateMultipartUpload.html
Let's keep the changes separate, so we can ship the first two:
- update GCS SDK version in #4031 after 2.16 1a. or 2. use of the new InsertObject signature (can be part of 1) after bumping to GCS 2.9
... 3. Possibly remove polling as in this PR, if we can fully substantiate a safe performance improvement
multipart uploads do not have the strong consistency required.
@eric-hughes-tiledb we are not using the S3-compatible multi-part uploads in GCS. In fact they are not even available in the C++ SDK. What we use is the Compose JSON API, which is officially confirmed to be strongly consistent.
Marking as draft again, let's do #4031 first.
Closing this for now. The change is a good one but we don't have time to properly test this at the moment. We also have a tracking story for when we are ready to do this work.