ginepro
ginepro copied to clipboard
Connection timeouts
Bug description
Symptoms
- GRPC requests are taking too long to time out when Kubernetes network policies are misconfigured.
- The
connection_timeout_is_not_fatal
test takes ~75 seconds to finish.
Well-configured timeouts are important for system stability. Requests which take too long can hog resources and block other work from happening.
Causes
I can see two separate timeout problems:
-
DNS resolution – when
ResolutionStrategy::Lazy
is used, there is currently no way to apply a timeout just for DNS resolution. If DNS never resolves, requests never complete. -
TCP connection – there is currently no way to set a connection timeout. On my machine, the socket seems to time out after ~75s. Even when we do set a connection timeout,
tonic
doesn't use it!
Even though we're setting our own fairly short timeouts around the overall request, I've seen some strange behaviour where requests are hanging for a long time. I think there's still something else going on that I don't understand, but I expect addressing the two points above will be generally helpful anyway.
To Reproduce
For the TCP connection timeout, just run the tests. I'll supply a test for lazy DNS resolution timeouts in a separate PR.
Expected behavior
Ability to control timeouts for TCP connections and DNS resolution.
Environment
- OS: MacOS
- Rust version: rustc 1.65.0 (897e37553 2022-11-02)
Additional context
Solutions
The TCP connection timeout is simpler to solve (though I will admit took me a long time to find): we just need to set connect_timeout
in the right places. First, topic
doesn't respect connect_timeout
, which will be fixed by hyperium/tonic#1215. When that is merged, we can create our own connect_timeout
option on top of it in #38.
DNS resolution is harder. There are currently two options:
- Lazy resolution seems to be the default, but it also seems very hard to put a timeout around (at least, I can't think of a way!). It's possible to put an overall timeout around every request in the application, but this is 1. overly broad and 2. error-prone. It's very easy to forget or simply not know that it's needed ("why don't the other timeouts take care of it?").
-
Eager resolution has a timeout option. In practice, using this will probably mean doing DNS resolution on service startup, when the
LoadBalancedChannel
is created. This might be a good thing, preventing services from successfully starting when DNS would never resolve.
Of the two, I wonder if we should favour Eager resolution, and consider changing the default to this.
~However, we might want a third option: Active lazy resolution (for want of a better name). Lazy resolution is currently passive, as in it happens in the background on a schedule. It is never actively called in the request flow, which is why it's hard to put a timeout around. Instead, could we implement something which actively calls probe_once()
(with a timeout!) as part of the first request (or alternatively when GrpcServiceProbe.endpoints
is empty)? This could give us lazy DNS resolution, but with timeouts.~
Scratch that, I took a different approach: tower-rs/tower#715. EDIT: Nope, that hasn't worked out. Back to the drawing board.