riptide
riptide copied to clipboard
Configurable Call Timeout
I would like to be able to configure timeouts on a per request basis similar to what OkHttp calls Call Timeout (link).
Detailed Description
My current understanding is, that following request-related timeouts can be configured:
connections.connect-timeoutfor starting the TCP connectionconnections.socket-timeoutfor data flow/ packetstimeouts.globalwrapping everything (retries, queuing, backup requests, fallbacks, etc)
From my perspective, what is missing is a way to define a time limit for a complete HTTP call (only), including resolving DNS, connecting, writing the request body, server processing, as well as reading the response body. OkHttp calls this Call Timeout.
Context
Currently it appears to me that there are only super low-level or super high-level timeout configuration possibilities, which makes it hard to configure SLO-based timeouts. As upstream services often define latency objectives on a per-request basis, I would also like to configure my request timeouts on a per request basis.
To be more specific: I am neither interested in a particular connections.connect-timeout nor a specific connections.socket-timeout as long as the entire HTTP request finishes within a given time.
Unfortunately the timeouts.global is too global on the other hand because it spans multiple request, which means that configuring something like this...
clients:
example:
connections:
connect-timeout: 25 milliseconds
socket-timeout: 25 milliseconds
retry:
enabled: true
max-retries: 2
timeouts:
enabled: true
global: 150 milliseconds
... could still result in the global timeout kicking in before any retry happens.
Possible Implementation
- a per call timeout (maybe via a plugin)
- policy composition autoconfig possibilities
Your Environment
- Version used: 3.0.0-RC.15
- pretty standard Springboot app (2.3.7.RELEASE)
Currently it appears to me that there are only super low-level or super high-level timeout configuration possibilities, which makes it hard to configure SLO-based timeouts. As upstream services often define latency objectives on a per-request basis, I would also like to configure my request timeouts on a per request basis.
The way it was supposed to be used is to split your client up into multiple clients if there is a need to separate configuration within a client:
riptide.clients:
example-login:
connections:
connect-timeout: 25 milliseconds
socket-timeout: 25 milliseconds
retry:
enabled: true
max-retries: 2
timeouts:
enabled: true
global: 150 milliseconds
example-download:
connections:
connect-timeout: 25 milliseconds
socket-timeout: 500 milliseconds
retry:
enabled: true
max-retries: 2
timeouts:
enabled: true
global: 30 seconds
From my perspective, what is missing is a way to define a time limit for a complete HTTP call (only), including resolving DNS, connecting, writing the request body, server processing, as well as reading the response body.
That's what connect-timeout (resolving DNS, connecting) and socket-timeout (server processing, as well as reading the response body) are for. Writing the request body is not subject to either of those two timeouts, because it's the client's responsibility, not the servers.
Global timeouts allow you to manage your own SLAs (what you promise your own clients) while connect- and socket-timeouts (and to some degree backup-requests) allow you to deal with SLAs of your downstream dependencies.
The way it was supposed to be used is to split your client up into multiple clients
That's understood and also what we do.
That's what connect-timeout (resolving DNS, connecting) and socket-timeout (server processing, as well as reading the response body) are for.
Please correct me if I am wrong, but setting connect-timeout: 25 milliseconds and socket-timeout: 25 milliseconds will not result in a "(call) timeout" of 50ms as the socket timeout (SO_TIMEOUT) is defined as the "maximum period inactivity between two consecutive data packets" (link). It's a common misconception that a socket timeout is the timeout to receive the full response. This IMO means that with riptide in its current form you cant easily (if at all) "fail fast and retry", because your initial request might eat up all your global timeout budget without any retry ever kicking in.
What I want my application to do is to not wait longer than e.g. 50ms per request for a specific client and if this 50ms "call timeout" budget was eaten up, trigger one of my retries (which then has another 50ms guaranteed "call timeout" budget but not more)... Similar to what you would achieve with changing policy composition from the current default Timeout(RetryPolicy(CircuitBreaker(Supplier))) (simplified) to RetryPolicy(CircuitBreaker(Timeout(Supplier))) (simplified).
Please correct me if I am wrong, but setting connect-timeout: 25 milliseconds and socket-timeout: 25 milliseconds will not result in a "(call) timeout" of 50ms as the socket timeout (SO_TIMEOUT) is defined as the "maximum period inactivity between two consecutive data packets" (link).
Yes, correct.
This IMO means that with riptide in its current form you cant easily (if at all) "fail fast and retry", because your initial request might eat up all your global timeout budget without any retry ever kicking in.
You can with a backup request. The following will issue a second, concurrent request if the first one took longer than 50 milliseconds. Backup requets are essentially latency-based retries. Sounds like what you want in this case (because your strategy to counter them is retrying).
riptide.clients:
example:
connections:
connect-timeout: 25 milliseconds
socket-timeout: 25 milliseconds
backup-request:
enabled: true
delay: 50 milliseconds
timeouts:
enabled: true
global: 150 milliseconds
Similar to what you would achieve with changing policy composition from the current default
Timeout(RetryPolicy(CircuitBreaker(Supplier)))(simplified) toRetryPolicy(CircuitBreaker(Timeout(Supplier)))(simplified).
Correct, that would give you what you want. There is actually no need to remove the outer most Timeout. Both serve a purpose.
So you'd like to have this, then?
riptide.clients:
example:
connections:
connect-timeout: 25 milliseconds
socket-timeout: 25 milliseconds
retry:
enabled: true
max-retries: 2
timeouts:
enabled: true
call: 50 milliseconds
global: 160 milliseconds # 3 calls, each 50ms + some time to issue a new request
This would only really be useful for scenarios where there is no retry desired, i.e. really fail fast.
Personally, I don't see a big value in failing because something took to long only to immediately retry it. Your client will have to wait either way.
Yes, we also just discussed the usage of backup-request, but that is limited to one additional request, right?
Correct, that would give you what you want. There is actually no need to remove the outer most Timeout. Both serve a purpose.
Correct, I simplified to make my point clear.
I don't see a big value in failing because something took to long only to immediately retry it. Your client will have to wait either way.
I think the assumption here is: If a call already took longer than expected latency (e.g. p999 + 10% = 25ms) the possibility that it will finish any time soon is rather low. So better stop waiting and issue a new request, reusing the open connection. But you are right, sounds like nice usecase for a backup call.
Edit: Allowing to configure "local (call) timeouts" in addition to "global timeouts" seems like a good addition nevertheless.
Yes, we also just discussed the usage of backup-request, but that is limited to one additional request, right?
Correct.
I think the assumption here is: If a call already took longer than expected latency (e.g. p999 + 10% = 25ms) the possibility that it will finish any time soon is rather low. So better stop waiting and issue a new request, reusing the open connection. But you are right, sounds like nice usecase for a backup call.
Backup requests would run concurrently and give the original request a chance to still finish. The first request to finish will be used. Might as well be the original request.
Allowing to configure "local (call) timeouts" in addition to "global timeouts" seems like a good addition nevertheless.
Are you interested in contributing this? The way I see it, it's essentially just an addition to the auto configuration, because all the building blocks are already there (FailsafePlugin+ Timeout policy).
@whiskeysierra Yes, I believe we have a couple of guys who would like to contribute.
@jbspeakr I saw you made a commit to your fork. Dare to open a PR?