google-cloud-cpp RPC timeouts and retry loops across services

The google::cloud::bigtable::RPCRetryPolicy classes have special member functions to setup a timeout:

https://github.com/googleapis/google-cloud-cpp/blob/81adbdabca2da794875fd567591fddc2cf0c8394/google/cloud/bigtable/rpc_retry_policy.h#L98-L101

https://github.com/googleapis/google-cloud-cpp/blob/81adbdabca2da794875fd567591fddc2cf0c8394/google/cloud/bigtable/rpc_retry_policy.cc#L53-L57

This is used in the synchronous retry loop, e.g.:

https://github.com/googleapis/google-cloud-cpp/blob/81adbdabca2da794875fd567591fddc2cf0c8394/google/cloud/bigtable/table.cc#L90-L95

But not in the asynchronous retry loop:

https://github.com/googleapis/google-cloud-cpp/blob/81adbdabca2da794875fd567591fddc2cf0c8394/google/cloud/internal/async_retry_unary_rpc.h#L145-L149

That asynchronous retry loop is used in spanner too, and the RetryPolicy for spanner does not have a Setup() member function. Furthermore, the same retry policy class is used in storage, where we would't want to introduce it (because that would create an unwanted dependency on gRPC).

We need a cleaner way to setup per-RPC timeouts across all services, with good support for REST, and we need to use that on the asynchronous and synchronous loops.

Aug 17 '20 18:08 coryan

Carlos and I just chatted about this a bit offline. I'm going to try to summarize what I think we want/need here, and this will help expose my own misunderstandings.

We need to make sure our retry policies all properly set timeouts for both sync and async APIs. This should work for all services, whether using grpc or REST. As described in the comment above, Bigtable uses its own RPCRetryPolicy (not the common one), and Bigtable's policy does set timeouts for sync RPC, but not async. Ideally, this would be handled by the common retry policy and Bigtable would not need to provide its own.
It's difficult to unit test with any retry policy because they sleep. It would be nice to provide a hook where tests can mock/fake out the sleep part so that unit tests will run faster and be less flaky. This should work for both sync and async retries.
It may not be possible to make the above changes without breaking the retry policy API, which is exposed publicly to users (spanner example). While we want to avoid API breakages, it may be necessary in this case.
A change to how retries work and the design of our retries needs a design doc.

How urgent is this? It's not work-on-the-weekends urgent, but it's pretty important. We currently have shipping products that don't have proper timeouts. We're about to ship Pub/Sub, which will have the same issues (because it's using the common retry policies), and it would be nice to not break Pub/Sub's API right after it GAs (let's try to do that before). So I think the urgency is that someone should own this bug and it should probably be their top priority until its finished.

Aug 26 '20 14:08 devjgm

nit: the retry policies do not sleep today, the retry loop does, and the sleep time comes from the backoff policy. Your bigger point about this making retry loops hard to test stands.

Aug 26 '20 14:08 coryan

@devbww any update on this?

Sep 21 '20 21:09 devjgm

any update on this?

No. Indeed, I am only starting to work on it directly today.

Sep 22 '20 14:09 devbww

Turns out #5959 might be directly related to this.

Mar 09 '21 21:03 coryan

FWIW, after #6014 the deadline is being set for asynchronous operations in the Cloud Bigtable client library, and the retry loops can handle any retry policy that has a Setup(grpc::ClientContext&) member function.

Note that the general problem of how to rationalize our retry and backoff policies remains open.

Mar 10 '21 13:03 coryan

(internal only) Brainstorm doc

We need to make sure our retry policies all properly set timeouts for both sync and async APIs.

One of the ideas in the above brain-storming document is not to use the retry policy for RPC deadlines. The amount of time you're willing to wait to start an RPC (retry time limit) is distinct from the amount of time you're happy to let it run (deadline).

Rather, a functor option can be added where the user can set the grpc::ClientContext (deadline + ...) directly.

Nov 17 '21 20:11 devbww

Still need this. @dbolduc has some thoughts here too.

May 12 '22 19:05 devjgm

As @devbww mentioned, we have decided to separate the policies from the grpc::ClientContext configuration. We know want to capture all functionality listed here via Options (which may be per-call / per-client / per-connection). At the moment, not all libraries can do this.

We are happy with the common options for retry/backoff/polling policies as they are. In #7529 we handled the discrepancy between bigtable policies and common policies by using the generic g::c::internal::GrpcSetupOption to act on the context (under the hood).

I think we will likely want to release that option in our public API. I think we will definitely want to offer a

// lives in google_cloud_cpp_common. Can apply to REST and gRPC.
struct DeadlineOption {
  using Type = std::chrono::milliseconds;
};

... and a

// lives in google_cloud_cpp_common. Can apply to REST and gRPC.
struct MetadataOption {
  using Type = std::map<std::string, std::string>;
};

... for convenience. These options would take effect here for gRPC, and idk where for REST.

May 12 '22 19:05 dbolduc

We still want this, but not scheduled yet.

Sep 21 '22 18:09 coryan

Let's consider this for 2023/Q4.

Jun 14 '23 18:06 coryan

Included in the 2024 roadmap document.

Dec 06 '23 20:12 coryan

We still want to do this in order to increase feature parity with other Cloud SDK libraries.

May 15 '24 19:05 scotthart