ginepro icon indicating copy to clipboard operation
ginepro copied to clipboard

Connection timeouts

Open ThomWright opened this issue 2 years ago • 6 comments

Bug description

Symptoms

  1. GRPC requests are taking too long to time out when Kubernetes network policies are misconfigured.
  2. The connection_timeout_is_not_fatal test takes ~75 seconds to finish.

Well-configured timeouts are important for system stability. Requests which take too long can hog resources and block other work from happening.

Causes

I can see two separate timeout problems:

  1. DNS resolution – when ResolutionStrategy::Lazy is used, there is currently no way to apply a timeout just for DNS resolution. If DNS never resolves, requests never complete.
  2. TCP connection – there is currently no way to set a connection timeout. On my machine, the socket seems to time out after ~75s. Even when we do set a connection timeout, tonic doesn't use it!

Even though we're setting our own fairly short timeouts around the overall request, I've seen some strange behaviour where requests are hanging for a long time. I think there's still something else going on that I don't understand, but I expect addressing the two points above will be generally helpful anyway.

To Reproduce

For the TCP connection timeout, just run the tests. I'll supply a test for lazy DNS resolution timeouts in a separate PR.

Expected behavior

Ability to control timeouts for TCP connections and DNS resolution.

Environment

  • OS: MacOS
  • Rust version: rustc 1.65.0 (897e37553 2022-11-02)

Additional context

Solutions

The TCP connection timeout is simpler to solve (though I will admit took me a long time to find): we just need to set connect_timeout in the right places. First, topic doesn't respect connect_timeout, which will be fixed by hyperium/tonic#1215. When that is merged, we can create our own connect_timeout option on top of it in #38.

DNS resolution is harder. There are currently two options:

  1. Lazy resolution seems to be the default, but it also seems very hard to put a timeout around (at least, I can't think of a way!). It's possible to put an overall timeout around every request in the application, but this is 1. overly broad and 2. error-prone. It's very easy to forget or simply not know that it's needed ("why don't the other timeouts take care of it?").
  2. Eager resolution has a timeout option. In practice, using this will probably mean doing DNS resolution on service startup, when the LoadBalancedChannel is created. This might be a good thing, preventing services from successfully starting when DNS would never resolve.

Of the two, I wonder if we should favour Eager resolution, and consider changing the default to this.

~However, we might want a third option: Active lazy resolution (for want of a better name). Lazy resolution is currently passive, as in it happens in the background on a schedule. It is never actively called in the request flow, which is why it's hard to put a timeout around. Instead, could we implement something which actively calls probe_once() (with a timeout!) as part of the first request (or alternatively when GrpcServiceProbe.endpoints is empty)? This could give us lazy DNS resolution, but with timeouts.~

Scratch that, I took a different approach: tower-rs/tower#715. EDIT: Nope, that hasn't worked out. Back to the drawing board.

ThomWright avatar Dec 28 '22 11:12 ThomWright