google-cloud-cpp icon indicating copy to clipboard operation
google-cloud-cpp copied to clipboard

Expose request parameters to Retry and Backoff policies

Open coryan opened this issue 4 years ago • 5 comments

Some applications want to tune (maybe dynamically?) their retry policies based on the bucket and/or object involved.

It may not be necessary to expose all the parameters in the request, maybe just the bucket and object name (if applicable).

Maybe we should accept the retry / backoff policies (or their prototypes) as parameters in each Client::Foo() function.

coryan avatar Aug 18 '21 18:08 coryan

Feedback increases the number of requested parameters to the following:

Bucket Path within a bucket The key Retry count (so far) Time taken for last attempt User account Preferably able to add application metadata (on per C++ client instance?). HTTP/gRPC return value (e.g. 429, 5xx)

jpinkerton1 avatar Aug 19 '21 18:08 jpinkerton1

Background

As described above, we have customers that need additional information to have finer grained control of their retry loops. They also wish to capture information from the retry loop for additional logging, troubleshooting, and maybe even monitoring.

The C++ client libraries use a class to control their retry loops. Application developers can provide their own class, or the existing classes with parameters other than the defaults. The interface for this class is fairly narrow with the main APIs in:

https://github.com/googleapis/google-cloud-cpp/blob/0b2f1e5f5d3c82747259ea7ba3c3b47b060fc7fc/google/cloud/internal/retry_policy.h#L64-L66

and

https://github.com/googleapis/google-cloud-cpp/blob/0b2f1e5f5d3c82747259ea7ba3c3b47b060fc7fc/google/cloud/internal/retry_policy.h#L93

Changing the signature of these functions in any way would be a breaking change. We generally avoid such changes, as they impose work on customers that just want to update the client library. Even adding a pure virtual function (one that has = 0) is a breaking change.

Note that a new instance of the RetryPolicy class is created at the beginning of each retry loop. That is, the clone() API provides the application a notification that the retry loop started.

It is also important to know that downloads aren't just retried, they are resumed. A large object download may be interrupted in the middle, in which case the client library will resume from the last successfully received byte. A new retry policy object is used to restart the download, that is, clone() is called again if the download is interrupted.

Overview (Usage)

We propose expanding the RetryPolicy API to receive a set of labels when a new retry loop starts. The application developers can use this to create custom retry policies that are more adaptive, for example:

namespace gcs = google::cloud::storage;
using gcs::RequestLabels = std::map<std::string, std::string>;

class MyRetryPolicy : public gcs::RetryPolicy {
  gcs::RequestLabels labels_;
  int attempts_ = 0;
 public:
   MyRetryPolicy(gcs::RequestLabels labels) : labels_(std::move(labels)) {}

   bool OnFailure(Status const& s) override {
     ++attempts_;
     if (IsPermanentFailure(s)) return false; // cannot retry permanent failures
     if (labels["bucket"] == "super-important-bucket") return attempts_ < 1000;
     return attempts_ < 5;
   }

  std::unique_ptr<gcs::RetryPolicy> labeled_clone(gcs::RequestLabels labels) override {
    return std::make_unique<MyRetryPolicy>(std::move(labels));
  }
  std::unique_ptr<gcs::RetryPolicy> clone() {
    return std::make_unique<MyRetryPolicy>({}); // empty label set
  }

 protected:
    void OnFailureImpl(google::cloud::Status const&) override {}
};

The application would provide this customized retry policy to the library using the existing mechanisms, and indicate that they want the additional labels provided to their retry policy. Because creating the labels consumes CPU cycles, this would be disabled by default, but can be enabled with an additional option:

void Application() {
  auto client = gcs::Client(google::cloud::Options{}
      .set<gcs::RetryPolicyOption>(MyRetryPolicy({}).clone())
      .set<gcs::EnableRequestLabels>());
  // use `client` as before ...
}

What would be in the labels?

The set of labels provided to each retry policy object would vary by request, but would include:

  • The name of the operation, e.g., storage.objects.get or storage.objects.insert or storage.buckets.list. We would always use the JSON name, even when the library was using gRPC under the hood.
    • storage.objects.get should include an additional label to indicate if it is a metadata get or a full download
    • storage.objects.insert should include an additional label to indicate if this a simple, JSON multipart, or resumable upload
    • resumable uploads should include a label to indicate what offset are we inserting at.
  • For operations acting on objects, the object name and the object generation (if provided)
  • For operations acting on ACLs, the entity
  • In general, the most important attributes of the request, but not every single detail.
  • The full payload for the request will NOT be included as a label.

Application Defined Labels

The application can add additional labels in their policies:

class MyRetryPolicy : public gcs::RetryPolicy {
  gcs::RequestLabels labels_;
  gcs::RequestLabels custom_;
  int attempts_ = 0;
 public:
   MyRetryPolicy(gcs::RequestLabels labels, gcs::RequestLabels custom) : labels_(std::move(labels)), custom_(std::move(labels)) {}

  std::unique_ptr<gcs::RetryPolicy> labeled_clone(gcs::RequestLabels labels) override {
    return std::make_unique<MyRetryPolicy>(std::move(labels), custom_); // also copy custom labels
  }
  std::unique_ptr<gcs::RetryPolicy> clone() {
    return std::make_unique<MyRetryPolicy>({}, custom_); // empty label set
  }

 protected:
    void OnFailureImpl(google::cloud::Status const&) override {}
};

Comparison against "requirements"

These are either included in this proposal, or trivial to compute:

  • Bucket: available in the labels.
  • Path within a bucket: available in the labels.
  • Retry count (so far): this is trivial to compute because each retry policy object is created when the retry loop starts.
  • Time taken for last attempt: this is trivial to compute comparing to the previous OnFailure() call.
  • HTTP/gRPC return value (e.g. 429, 5xx): not available, only the google::cloud::StatusCode is available and we have no plans to expose HTTP status codes or gRPC status codes (which are in different namespaces!) to the application.
  • Preferably able to add application metadata: the application can add these to their retry policy prototype.

These are not included, or make no sense:

  • The key: I do not know what this is.
  • User account: this information is not available to the client library, so not available.

Overview (Implementation)

We propose adding a new virtual function to RetryPolicy which clones with additional labels. To avoid complications around overloads and virtual functions, this will have a new name.

namespace google::cloud::internal {
class TraitsBasedRetryPolicy {
 public:
   // ... ... ... as before ... ... ...
   virtual std::unique_ptr<TraitsBasedRetryPolicy> clone_labels(RequestLabels /*unused*/) {
      return clone(); // ignore the labels by default
  }
};
}

This is not a breaking change because the new virtual function has default implementation, and the existing APIs are unchanged.

We will also need to change about 50 classes in the storage library to return their labels, for example:

namespace google::cloud::storage::internal {
class ObjectPatchRequest : ... {
 public:
   virtual RequestLabels labels() override {
      RequestLabels labels{{"bucket", bucket_name()}, {"object", object_name()}};
      if (HasOption<Generation>()) labels["generation"] = std::to_string(GetOption<Generation>());
      return labels;
  }
};
}

Finally, we will need to change the retry loop to call clone_labels() instead of clone(), and to fetch the labels to the request if gcs::EnableRequestLabels is set.

coryan avatar Aug 20 '21 16:08 coryan

That's a good proposal. I'd be happy if we do that, but I want to write down one small alternative to the labeled_clone() part of your proposal that we could consider and even dismiss...

Rather than adding a std::unique_ptr<gcs::RetryPolicy> labeled_clone(gcs::RequestLabels labels) override method to the gcs::RetryPolicy interface and subclasses, we could move away from "clone" functions and use factory functions instead. This may look like

struct RetyPolicyOption {
  using Type = std::function<gcs::RetryPolicy()>;
};
struct RetryPolicyLabeledOption {
  using Type = std::function<gcs::RetryPolicy(gcs::RequestLabels);
};
...
void Application() {
  auto factory = [](gcs::RequestLabels x) { return MyRetryPolicy(std::move(x)); };
  auto client = gcs::Client(google::cloud::Options{}
      .set<gcs::RetryPolicyLabeledOption>(std::move(factory)));
  // use `client` as before ...
}

This would let us avoid adding a new method to a policy interface (labeled_clone()), instead relying on a user-provided factory function to create the retry policy of their choice. We could also avoid adding the EnableRequestLabels tag-like option, and instead rely on type-safe RetryPolicyOption or RetryPolicyLabeledOption classes. Note: A user passing EnableRequestLabels with a policy instance that doesn't properly override labeled_clone() could result in surprising runtime behavior because it'll all Just Work but maybe not as the user expected.

A downside of this approach is that it requires a bit more internal refactoring on our side to change our code to use retry policy factory functions rather than prototype clone methods.

devjgm avatar Aug 20 '21 17:08 devjgm

Still on the roadmap. Maybe easier to implement via the per-call options.

coryan avatar Apr 28 '22 18:04 coryan

Probably something for me in Q4

coryan avatar Sep 14 '22 18:09 coryan

I think the problems that this feature request wanted to solve are better addressed by the per-operation parameters in storage::Client. Applications can create any retry or backoff policy they want before making a request. And since the application knows exactly what parameters will use for the request, they can create the retry policy that goes along with those parameters.

coryan avatar Oct 05 '22 15:10 coryan