httr icon indicating copy to clipboard operation
httr copied to clipboard

Implement optional "circuit-breaking" for RETRY()

Open atheriel opened this issue 4 years ago • 0 comments

Retrying requests is a good way to make code robust against transient failures in the upstream API. However, even with exponential backoff, it still faces a fundamental challenge when the service at the other end is truly unavailable for more than a few seconds. In those cases, retrying solves nothing and actually creates two more problems:

  1. Users have to wait an artificially long time for their requests to fail; and
  2. We generate additional load for an already-struggling service.

(For public APIs, the latter might not be our problem, but it's not responsible behaviour, either.)

One can also end up in a "cascading failure" scenario, where a buildup of retries will DDoS the underlying service and prevent it from recovering, triggering retries and more latency further and further downstream until the whole system is unresponsive. (Which was the inspiration for this PR.)

The generally-accepted way to avoid these problems is by implementing the so-called "circuit breaker pattern" (popularised by Michael Nygard, Martin Fowler, and Neflix's Hystrix library): once a given failure threshold has been met, a circuit is "tripped" and additional calls will fail immediately for some period of time.

A circuit breaker solves both problems above: clients calling RETRY() suddenly become responsive again (since they no longer wait at all), and we avoid adding further upstream load.

This PR implements this pattern by tracking failures at the handle level, which is both a natural choice and also highly convenient given httr's internals. Failures are currently defined as curl-level errors, HTTP 5xx status codes, and also HTTP 408, which is explicitly for client-side request timeouts. Once the circuit is tripped, all subsequent calls to RETRY() issue an error until a timeout has elapsed. Meanwhile, any successful response will reset the failure counter.

At present it's very hard to implement circuit-breaking around RETRY(), because the function doesn't expose information about retried requests (or whether they occurred at all). Moreover, RETRY() seems explicitly designed to help R users do the right thing by implementing backoff (correctly) for them. This makes me think that a feature like this belongs in-tree as well.

Outstanding questions:

  • Should the error use a custom class?

atheriel avatar Sep 04 '20 20:09 atheriel