bazel icon indicating copy to clipboard operation
bazel copied to clipboard

Retry on HTTP remote cache fetch failure

Open aherrmann opened this issue 2 years ago • 10 comments

Bazel's previous behavior was to rebuild an artifact locally if fetching it from an HTTP remote cache failed. This behavior is different from GRPC remote cache case where Bazel will retry the fetch.

The lack of retry is an issue for multiple reasons: On one hand rebuilding locally can be slower than fetching from the remote cache, on the other hand if a build action is not bit reproducible, as is the case with some compilers, then the local rebuild will trigger cache misses on further build actions that depend on the current artifact.

This change aims to avoid theses issues by retrying the fetch in the HTTP cache case similarly to how the GRPC cache client does it.

Some care needs to be taken due to the design of Bazel's internal remote cache client API. For a fetch the client is given an OutputStream object that it is expected to write the fetched data to. This may be a temporary file on disk that will be moved to the final location after the fetch completed. On retry, we need to be careful to not duplicate previously written data when writing into this OutputStream. Due to the generality of the OutputStream interface we cannot reset the file handle or write pointer to start fresh. Instead, this change follows the same pattern used in the GRPC cache client. Namely, keep track of the data previously written and continue from that offset on retry.

With this change the HTTP cache client will attempt to fetch the data from the remote cache via an HTTP range request. So that the server only needs to send the data that is still missing. If the server replies with a 206 Partial Content response, then we write the received data directly into the output stream, if the server does not support range requests and instead replies with the full data, then we drop the duplicate prefix and only write into the output stream from the required offset.

This patch has been running successfully in production here.

cc @cocreature

TODO

  • [ ] Testing - this PR does not include any testing, yet. If anyone has any pointers on how to best test this in the Bazel code base please let me know. One option I briefly explored was to add a test-case to src/test/java/com/google/devtools/build/lib/remote/http/HttpCacheClientTest.java. But, that seemed to require implementing support for HTTP range requests in src/tools/remote/src/main/java/com/google/devtools/build/remote/worker/http/HttpCacheServerHandler.java. Perhaps there is a better way to test this change.

aherrmann avatar Nov 11 '21 11:11 aherrmann

Do you know the reason to the fetch failures / closed connections? Is there high package loss due to package congestion? Or overloaded server? Something else?

ulrfa avatar Nov 15 '21 07:11 ulrfa

Do you know the reason to the fetch failures / closed connections? Is there high package loss due to package congestion? Or overloaded server? Something else?

Unfortunately, I don't know what exactly caused these issues. Congestion may have been an issue in some instances. But, even with a low degree of parallelism to restrict the amount of parallel fetches we saw sporadic failures. For a bit more context: the remote cache is on GCS with CDN. Fetches happen from CI nodes in GCP as well as developer machines working from all over.

aherrmann avatar Feb 09 '22 14:02 aherrmann

Hello @aherrmann, Above PR has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days, if no further activity occurs. Thank you!

sgowroji avatar Jul 26 '22 06:07 sgowroji

Hi @sgowroji,

is there anything we should do / change with this PR to get it merged?

avdv avatar Jul 26 '22 09:07 avdv

Hello @avdv, its been stale for long time. Can you contribute on the same request and send it for code review. Thank you !

sgowroji avatar Jul 26 '22 10:07 sgowroji

@brentleyjones @coeuvre Would one of you be available to review or suggest a reviewer? It would be really unfortunate if this effort stalled.

fmeum avatar Jul 26 '22 10:07 fmeum

@coeuvre We spoke at BazelCon, I mentioned this PR to you, and you asked me to ping you on it. What would you say is missing and what's the best way to get this merged?

aherrmann avatar Nov 28 '22 10:11 aherrmann

It seems that this PR tries to do two things:

1. Retry the download/upload if certain errors happened.

2. Upon retry of download, continue the last download at certain offset.

IMO, it's risky to combine these two within one PR. Can you split 2) into another PR?

Correct, it does these two things. Unfortunately, it doesn't seem to be possible to separate the two without much larger changes to the Bazel code-base as described in the PR description above:

Some care needs to be taken due to the design of Bazel's internal remote cache client API. For a fetch the client is given an OutputStream object that it is expected to write the fetched data to. This may be a temporary file on disk that will be moved to the final location after the fetch completed. On retry, we need to be careful to not duplicate previously written data when writing into this OutputStream. Due to the generality of the OutputStream interface we cannot reset the file handle or write pointer to start fresh. Instead, this change follows the same pattern used in the GRPC cache client. Namely, keep track of the data previously written and continue from that offset on retry.

Please let me know if you see a good way to separate the two concerns.


Also, can you please add some tests for 1)?

Without being able to separate 1) and 2) (as described above), could you give some pointers toward this question from the original PR message:

If anyone has any pointers on how to best test this in the Bazel code base please let me know. One option I briefly explored was to add a test-case to src/test/java/com/google/devtools/build/lib/remote/http/HttpCacheClientTest.java. But, that seemed to require implementing support for HTTP range requests in src/tools/remote/src/main/java/com/google/devtools/build/remote/worker/http/HttpCacheServerHandler.java. Perhaps there is a better way to test this change.

aherrmann avatar Dec 14 '22 15:12 aherrmann

Unfortunately, it doesn't seem to be possible to separate the two without much larger changes to the Bazel code-base as described in the PR description above:

Sorry, I missed that part. Yes, you are right. Then I am ok to merge them at once if we have enough test coverage.

Without being able to separate 1) and 2) (as described above), could you give some pointers toward this question from the original PR message:

I think the best place for the tests are in src/test/java/com/google/devtools/build/lib/remote/http/HttpCacheClientTest.java. It's not necessary to implement range requests for HttpCacheServerHandler. Using mocks is fine as long as we are able to cover the edge cases added in the client code.

coeuvre avatar Dec 14 '22 16:12 coeuvre

Instead, this change follows the same pattern used in the GRPC cache client. Namely, keep track of the data previously written and continue from that offset on retry.

The GRPC cache client seems to do that for CAS downloads, but not when retrieving AC, right? And this PR seems to do it also for AC, right?

I think it is safe to do for CAS where the key represent a specific content, but a retried download of AC key could result in different ActionResult. And concatenating chunks from different ActionResult is not reliable. Do you have ideas about how to address that?

ulrfa avatar Dec 15 '22 08:12 ulrfa

I think it is safe to do for CAS where the key represent a specific content, but a retried download of AC key could result in different ActionResult. And concatenating chunks from different ActionResult is not reliable. Do you have ideas about how to address that?

@ulrfa I think that's possible. The blob download takes an external output stream, so that's where we don't have control and need to continue from the offset. But, the AC download controls the output stream, so, if we perform the retry at a higher level, then we'd be able to do a fresh download.

aherrmann avatar Jan 10 '23 09:01 aherrmann

@coeuvre @ulrfa @avdv Thank you for the review!

I’ve changed the retry logic such that retries of ActionResult fetches start a new download from scratch and write into a fresh buffer, whereas CAS downloads skip the previously downloaded prefix and write to the existing output stream from the offset.

I’ve removed RANGE requests. As pointed out above the implementation would not have handled compressed streams correctly. RANGE requests could still be introduced later on, but they could be seen as an optimization over this PR to avoid sending redundant bytes over the wire, and as an improvement in that fewer bytes to send should imply a lower likelihood of repeated intermittent failure.

I’ve factored out the logic deciding whether to retry into RETRIABLE_HTTP_ERRORS. And changed it to only retry on certain HTTP error codes instead of all HttpExceptions.

I’ve added tests for the retry mechanism on HttpDownloadHandler and HttpCacheClient.

I’ve also tested the changes against a test-setup and observed retries occurring as expected.

aherrmann avatar Jan 13 '23 13:01 aherrmann

Thanks for the update @aherrmann! I think the logic is fine and good test cases! Now I only have the question about the offset field, but that is only a minor implementation detail.

ulrfa avatar Jan 16 '23 15:01 ulrfa

@coeuvre I see this is labeled as awaiting-PR-merge but not yet merged. Is it stuck on anything internally? Anything I can help with to get this merged?

aherrmann avatar Feb 09 '23 12:02 aherrmann

Hi @aherrmann, Above PR merging is in progress. @coeuvre is merging the above PR.

sgowroji avatar Feb 10 '23 08:02 sgowroji

The changes in this PR have been included in Bazel 6.4.0 RC1. Please test out the release candidate and report any issues as soon as possible. If you're using Bazelisk, you can point to the latest RC by setting USE_BAZEL_VERSION=last_rc. Thanks!

iancha1992 avatar Sep 21 '23 21:09 iancha1992