tower icon indicating copy to clipboard operation
tower copied to clipboard

Extend reconnect to support a backoff strategy for connection errors

Open hawkw opened this issue 7 years ago • 8 comments

Currently, the tower-reconnect middleware tries to reconnect immediately if an error occurs while connected, but propagates all errors if an error occurs while connecting. We should modify this middleware to:

  • [ ] take a stream of durations to use as backoffs,
  • [ ] when an error occurs while connecting, try to reconnect with a wait time between attempts taken from each backoff value
  • [ ] only propagate the error if all the backoffs are exhausted
  • [ ] reset the backoff stream when a connection has been successfully established

hawkw avatar Mar 01 '18 00:03 hawkw

I'm happy to work on implementing this, but I suspect @carllerche will have some opinions on how we ought to do it.

hawkw avatar Mar 01 '18 00:03 hawkw

I think that this is a great idea.

My original thought was to not do backoff in the reconnect middleware, instead keep back off as part of Retry (which doesn't exist yet) and then Retry<Reconnect<...>> would provide the backoff reconnect behavior.

However, after thinking some more about it, I don't think that this is necessarily ideal because Retry will incur some amount of additional overhead necessary to keep a handle to the original request and performing the retry of the request. This overhead probably isn't necessary int he reconnect w/ back off case.

Given this, implementing the logic directly in tower-reconnect is probably the right first (and maybe final) step. Once we tackle Retry we can then think about what, if anything, can be shared between the two.

@hawkw if you want to try to take a stab at this, go for it. The one nit with your original list is:

take a stream of durations to use as backoffs

This probably won't be a stream in the futures sense. Instead, the type (Backoff?) will probably implement IntoIterator such that the Item = Duration.

Maybe @olix0r or @danburkert have additional thoughts on the matter.

This relates a little to https://github.com/tower-rs/tower/issues/14.

carllerche avatar Mar 01 '18 17:03 carllerche

My original thought was to not do backoff in the reconnect middleware, instead keep back off as part of Retry (which doesn't exist yet) and then Retry<Reconnect<...>> would provide the backoff reconnect behavior.

However, after thinking some more about it, I don't think that this is necessarily ideal because Retry will incur some amount of additional overhead necessary to keep a handle to the original request and performing the retry of the request. This overhead probably isn't necessary int he reconnect w/ back off case.

My thought is that we provide a "unit" or "empty" backoff type, and a Retry with the empty/unit backoffs would just do the current Reconnect behaviour of immediately failing the request on connect errors, because the backoffs are always exhausted.

Instead, the type (Backoff?) will probably implement IntoIterator such that the Item = Duration.

Yeah, that seems right.

hawkw avatar Mar 01 '18 17:03 hawkw

A question that came up up while working on this is: what timer should be used for the backoffs?

In Conduit, I've been working on an abstraction over timer implementations (runconduit/conduit#480) that allows a mock timer to be injected for testing, and we'd expect that if I configure conduit to use a mock timer, backoffs will also wait based on the mock timer rather than the default timer. Furthermore, we'd ideally want users who are using tokio-timer to be able to use that timer for backoffs, without requiring the tokio-timer dependency. This implies to me that we might want to move the timer facade work I've been doing from Conduit to tower. Does that seem reasonable?

hawkw avatar Mar 02 '18 00:03 hawkw

I would prefer to not introduce a Timer trait yet. Traits come with ergonomic overhead. The next iteration of tokio-timer can handle the requirements of a being able to "mock" out timers.

carllerche avatar Mar 02 '18 21:03 carllerche

I do have some thoughts on retry strategies. I think the strategy @hawkw outlined in the first comment is valid, but it does have the downside that every error becomes a timeout error. This can be mitigated with some careful error management, but it's pretty tricky to do in an ergonomic way and without throwing away the intermediate errors.

The retry strategy also needs to be designed with speculative / hedged connections in mind. E.g. if the service has three replicas which you can choose from, you may not want to spend your full timeout attempting to connect to one of them. Perhaps this is better solved at a higher-level, though.

danburkert avatar Mar 03 '18 19:03 danburkert

@danburkert I think that request-level retry strategies should be handled at a higher level. Reconnect should really only be addressing the layer 4 concern of establishing a transport. I'd expect this to be wrapped with a connection pool, the connection pool to be wrapped with a balancer, and then retries to wrap the balancer. I'm in the process of writing up some general plans for https://github.com/runconduit/conduit/issues/475 -- would love your input once that's up.

olix0r avatar Mar 03 '18 20:03 olix0r

Ah ok, seems I misunderstood the requirements here. If you are reconnecting just the layer 4 transport (say, TCP), how do you handle application-level negotiations that need to take place like TCP, SASL, and custom handshakes?

In general doing anything more complex than layer 4 means that the reconnect logic needs to be able to distinguish between retriable and non-retriable errors.

danburkert avatar Mar 03 '18 20:03 danburkert