arrow GH-40557: [C++] Use `PutObject` request for S3 in OutputStream when only uploading small data

Rationale for this change

See #40557. The previous implementation would always issue multi part uploads which come with 3x RTT to S3 instead of just 1x RTT with a PutObject request.

What changes are included in this PR?

Implement logic in the S3 OutputStream to use a PutObject request if data is below a certain threshold (5 MB) and the output stream is closed. If more data is written, a multi part upload is triggered. Note: Previously, opening the output stream was already expensive because the CreateMultipartUpload request was triggered then. With this change opening the output stream becomes cheap, as we rather wait until some data is written to decide which upload method to use. This required some more state-keeping in the output stream class.

Are these changes tested?

No new tests were added, as there are already tests for very small writes and very large writes, which will trigger both ways of uploading. Everything should therefore be covered by existing tests.

Are there any user-facing changes?

Previously, we would fail when opening the output stream if the bucket doesn't exist. We inferred that by sending the CreateMultipartUpload request, which we now do not send anymore upon opening the stream. We now rather fail at closing, or at writing (when >5MB have accumulated). Replicating the old behavior is not possible without sending another request which defeats the purpose of this performance optimization. I hope this is fine.

GitHub Issue: #40557

May 07 '24 07:05 OliLay

:warning: GitHub issue #40557 has been automatically assigned in GitHub to PR creator.

May 07 '24 07:05 github-actions[bot]

From the log output, it seems like the failing CI jobs are not related to this change. Correct me if I am wrong though. Should I rebase (in case the flaky tests are already fixed on main)?

May 13 '24 13:05 OliLay

I think it’s worth rebasing to see?

May 13 '24 15:05 orf

Rebased to current main, now waiting for the CI approval again :)

May 14 '24 12:05 OliLay

Previously, we would fail when opening the output stream if the bucket doesn't exist. We inferred that by sending the CreateMultipartUpload request, which we now do not send anymore upon opening the stream. We now rather fail at closing, or at writing (when >5MB have accumulated).

Hmm, I'm not sure that is ok. Usually, when opening a file for writing, you expect the initial open to fail if the path cannot be written to. I have no idea how much code relies on that, but that's a common expectation due to how filesystems usually work (e.g. when accessing local storage).

May 14 '24 12:05 pitrou

Previously, we would fail when opening the output stream if the bucket doesn't exist. We inferred that by sending the CreateMultipartUpload request, which we now do not send anymore upon opening the stream. We now rather fail at closing, or at writing (when >5MB have accumulated).

Hmm, I'm not sure that is ok. Usually, when opening a file for writing, you expect the initial open to fail if the path cannot be written to. I have no idea how much code relies on that, but that's a common expectation due to how filesystems usually work (e.g. when accessing local storage).

This isn’t guaranteed with the current implementation though? Putting a part, or completing a multipart upload, can fail in various ways? An obvious one would be a checksum failure.

May 14 '24 12:05 orf

My point is that if the path cannot be written to, the error happens when opening the file, not later on.

May 14 '24 12:05 pitrou

My point is that if the path cannot be written to, the error happens when opening the file, not later on.

That is true. I guess the question is if arrow's OutputStream API makes an explicit guarantee that Open should throw if the target does not exist. My guess would be that you shouldn't built code upon this assumption if it isn't explicitly stated in arrow's API/docs (which it is not), but of course real-world usage deviates from that (Hyrum's Law). But checking if the bucket exists would at least come with another 1x RTT to S3 and the goal of the PR was to reduce the amount of blocking calls to S3 to reduce overall latency. If we add another check here, we'll have a total 2x RTT to S3 for small uploads, which is better than the initial 3x RTT without this change, but still not optimal from a performance-view. (and we would probably have 4x RTT for multipart uploads)

May 14 '24 13:05 OliLay

That is true. I guess the question is if arrow's OutputStream API makes an explicit guarantee that Open should throw if the target does not exist. My guess would be that you shouldn't built code upon this assumption if it isn't explicitly stated in arrow's API/docs (which it is not), but of course real-world usage deviates from that (Hyrum's Law).

The API docs generally do not go into that level of detail. However, it is a general assumption that a filesystem "looks like" a local filesystem API-wise.

It is also much more convenient to get an error early, than after you have already "written" multiple megabytes of data to the file.

A compromise would be to add a dedicated option in S3Options, but of course the optimization would only benefit those users that enable the option.

May 14 '24 13:05 pitrou

A compromise would be to add a dedicated option in S3Options, but of course the optimization would only benefit those users that enable the option.

We can do that. I would propose that if the optimization is disabled, we directly use multi-part uploads (basically replicating the old behavior). I don't think it makes sense to explicitly issue a HeadBucket request because that will lead to minimum 4 requests with multi-part uploads then. (although we would only have 2 requests for small writes without the optimization compared to current main) What do you think?

May 14 '24 13:05 OliLay

We can do that. I would propose that if the optimization is disabled, we directly use multi-part uploads (basically replicating the old behavior).

That sounds reasonable to me.

May 14 '24 14:05 pitrou

Just to note, issuing HeadBucket doesn't guarantee that a write will succeed - there isn't really a good way to check without actually writing. A HeadObject on the key and failing on any 403 is probably ok though? However there are valid cases where you'd want to write to a key that your principal is not able to read from. HeadBucket also requires full s3:ListBucket permissions, policies that restrict listing to specific prefixes would need to be updated.

I think an optimization flag is appropriate as the behaviour is technically changing. Does it make sense to make the flag a somewhat generic one, rather than specific to this case?

There are a few other optimizations that might also fall into the "more performant, slightly different semantics" category. If I was to contribute a PR to improve one of the linked areas, would we want to add a new specific flag for this case or bundle it under a single "optimized" flag?

The upside would be that it becomes more configurable, whereas the downside is that the testing and support matrix explodes. Perhaps it's better to just have a single optimized=True flag, vs receiving bug reports when specifically optimize_put_object=True, optimize_delete_dir=False, optimize_move=True, optimize_delete=False, optimize_ensure_parents_exist=True, optimize_foobar=True are set?

Edit: i guess this is only relevant for higher-level Python bindings, we'd still want internal flags for individual features.

May 14 '24 18:05 orf

I added a sanitize_bucket_on_open_ flag to the S3Options, adjusted the logic and also instantiated tests with this flag enabled. I guess the Python bindings can be tackled in a separate PR, right?

May 15 '24 07:05 OliLay

A minor question: would single-part only send during closing?

Yes, we only issue the PutObject request if we close the stream.

May 15 '24 10:05 OliLay

Sorry for delaying review! Would merge after other committers approve

May 24 '24 18:05 mapleFU

I also merged main into this branch due to a conflict. Should be free of conflicts now.

Jun 05 '24 06:06 OliLay

Hi @pitrou , I merged main again due to a conflict. Maybe you could give it another review if you have time.

Jun 18 '24 08:06 OliLay

Gental ping @pitrou . This could be in 18.0.0 (next release)

Jul 12 '24 03:07 mapleFU

@github-actions crossbow submit -g cpp -g python

Jul 22 '24 10:07 pitrou

Revision: 57ff9921779451b3b92762e35711c9ca82b9f24d

Submitted crossbow builds: ursacomputing/crossbow @ actions-95140b88e7

Task	Status
example-python-minimal-build-fedora-conda
example-python-minimal-build-ubuntu-venv
test-alpine-linux-cpp
test-build-cpp-fuzz
test-conda-cpp
test-conda-cpp-valgrind
test-conda-python-3.10
test-conda-python-3.10-cython2
test-conda-python-3.10-hdfs-2.9.2
test-conda-python-3.10-hdfs-3.2.1
test-conda-python-3.10-pandas-latest
test-conda-python-3.10-pandas-nightly
test-conda-python-3.10-spark-v3.5.0
test-conda-python-3.10-substrait
test-conda-python-3.11
test-conda-python-3.11-dask-latest
test-conda-python-3.11-dask-upstream_devel
test-conda-python-3.11-hypothesis
test-conda-python-3.11-pandas-upstream_devel
test-conda-python-3.11-spark-master
test-conda-python-3.12
test-conda-python-3.8
test-conda-python-3.8-pandas-1.0
test-conda-python-3.8-spark-v3.5.0
test-conda-python-3.9
test-conda-python-3.9-pandas-latest
test-cuda-cpp
test-cuda-python
test-debian-12-cpp-amd64
test-debian-12-cpp-i386
test-debian-12-python-3-amd64
test-debian-12-python-3-i386
test-fedora-39-cpp
test-fedora-39-python-3
test-ubuntu-20.04-cpp
test-ubuntu-20.04-cpp-bundled
test-ubuntu-20.04-cpp-minimal-with-formats
test-ubuntu-20.04-cpp-thread-sanitizer
test-ubuntu-20.04-python-3
test-ubuntu-22.04-cpp
test-ubuntu-22.04-cpp-20
test-ubuntu-22.04-cpp-emscripten
test-ubuntu-22.04-cpp-no-threading
test-ubuntu-22.04-python-3
test-ubuntu-24.04-cpp
test-ubuntu-24.04-cpp-gcc-14

Jul 22 '24 10:07 github-actions[bot]

@github-actions crossbow submit -g cpp -g python

Jul 22 '24 10:07 pitrou

Revision: 670232b77c711b0e124e8d2ad02c2c7f62d16a80

Submitted crossbow builds: ursacomputing/crossbow @ actions-86d9d85a0e

Task	Status
example-python-minimal-build-fedora-conda
example-python-minimal-build-ubuntu-venv
test-alpine-linux-cpp
test-build-cpp-fuzz
test-conda-cpp
test-conda-cpp-valgrind
test-conda-python-3.10
test-conda-python-3.10-cython2
test-conda-python-3.10-hdfs-2.9.2
test-conda-python-3.10-hdfs-3.2.1
test-conda-python-3.10-pandas-latest-numpy-1.26
test-conda-python-3.10-pandas-latest-numpy-latest
test-conda-python-3.10-pandas-nightly-numpy-nightly
test-conda-python-3.10-substrait
test-conda-python-3.11
test-conda-python-3.11-dask-latest
test-conda-python-3.11-dask-upstream_devel
test-conda-python-3.11-hypothesis
test-conda-python-3.11-pandas-upstream_devel-numpy-nightly
test-conda-python-3.11-spark-master
test-conda-python-3.12
test-conda-python-3.8
test-conda-python-3.8-pandas-1.0-numpy-1.19
test-conda-python-3.9
test-conda-python-3.9-pandas-latest-numpy-latest
test-conda-python-emscripten
test-cuda-cpp
test-cuda-python
test-debian-12-cpp-amd64
test-debian-12-cpp-i386
test-debian-12-python-3-amd64
test-debian-12-python-3-i386
test-fedora-39-cpp
test-fedora-39-python-3
test-ubuntu-20.04-cpp
test-ubuntu-20.04-cpp-bundled
test-ubuntu-20.04-cpp-minimal-with-formats
test-ubuntu-20.04-cpp-thread-sanitizer
test-ubuntu-20.04-python-3
test-ubuntu-22.04-cpp
test-ubuntu-22.04-cpp-20
test-ubuntu-22.04-cpp-emscripten
test-ubuntu-22.04-cpp-no-threading
test-ubuntu-22.04-python-3
test-ubuntu-24.04-cpp
test-ubuntu-24.04-cpp-gcc-14