TileDB icon indicating copy to clipboard operation
TileDB copied to clipboard

Various improvements to the GCS VFS.

Open teo-tsirpanis opened this issue 2 years ago • 9 comments

See each commit message for more details. This PR includes a fix for SC-26982


TYPE: IMPROVEMENT DESC: Various improvements to the Google Cloud Storage VFS

teo-tsirpanis avatar Mar 30 '23 12:03 teo-tsirpanis

Are we sure we want Remove polling for bucket/object propagation/deletion.? For multi-part uploads we've learned with GCS that they don't guarantee read-after-write, as such I think we need the wait for object to propagate?

Shelnutt2 avatar Apr 19 '23 10:04 Shelnutt2

Are you referring to this sentence of the docs? We are not using this API, but good question, it its not explicitly mentioned if object composition is strongly consistent. I submitted feedback on the docs.

teo-tsirpanis avatar Apr 19 '23 11:04 teo-tsirpanis

a fix for SC-26982

That issue does not report a defect. What's there to fix? A putative performance improvement is not a defect.

The GCS docs are incomplete and ambiguous about the relationship between multipart uploads and operations that are strongly consistent. The documentation about strong consistency refers only to a subset of operations, and it does not address issues of mutual consistency between these classes of operations. Therefore, we must assume that multipart uploads and (say) listing are not mutual consistent. In absence of a clear and definitive statement that states the relationship, we must not assume that the system has the properties we might wish it to have.

The mere existence of CompleteMultipartUpload as a GCS request is a strong signal that multipart uploads do not have the strong consistency required. So removing polling without replacing it with a different mechanism for determining completion is a Very Bad Idea. It's going to lead to occasional loss of data. Integrity of data is paramount. We may not jeopardize it.

Recommend closing this PR and not returning to until we can measure the performance improvement that might be gained for what looks to be a rather large risk.

eric-hughes-tiledb avatar Apr 19 '23 11:04 eric-hughes-tiledb

The mere existence of CompleteMultipartUpload as a GCS request is a strong signal that multipart uploads do not have the strong consistency required.

CompleteMultipartUpload is necessary because the multipart API does not require pre-specifying the total size of the final assembled object (it allows appending up to 10,000 parts before finalizing). See: https://docs.aws.amazon.com/AmazonS3/latest/API/API_CreateMultipartUpload.html

ihnorton avatar Apr 19 '23 13:04 ihnorton

Let's keep the changes separate, so we can ship the first two:

  1. update GCS SDK version in #4031 after 2.16 1a. or 2. use of the new InsertObject signature (can be part of 1) after bumping to GCS 2.9

... 3. Possibly remove polling as in this PR, if we can fully substantiate a safe performance improvement

ihnorton avatar Apr 19 '23 13:04 ihnorton

multipart uploads do not have the strong consistency required.

@eric-hughes-tiledb we are not using the S3-compatible multi-part uploads in GCS. In fact they are not even available in the C++ SDK. What we use is the Compose JSON API, which is officially confirmed to be strongly consistent.

teo-tsirpanis avatar Apr 20 '23 12:04 teo-tsirpanis

Marking as draft again, let's do #4031 first.

teo-tsirpanis avatar Apr 20 '23 12:04 teo-tsirpanis

Closing this for now. The change is a good one but we don't have time to properly test this at the moment. We also have a tracking story for when we are ready to do this work.

KiterLuc avatar Feb 27 '24 07:02 KiterLuc