riptide Configurable Call Timeout

I would like to be able to configure timeouts on a per request basis similar to what OkHttp calls Call Timeout (link).

Detailed Description

My current understanding is, that following request-related timeouts can be configured:

connections.connect-timeout for starting the TCP connection
connections.socket-timeout for data flow/ packets
timeouts.global wrapping everything (retries, queuing, backup requests, fallbacks, etc)

From my perspective, what is missing is a way to define a time limit for a complete HTTP call (only), including resolving DNS, connecting, writing the request body, server processing, as well as reading the response body. OkHttp calls this Call Timeout.

Context

Currently it appears to me that there are only super low-level or super high-level timeout configuration possibilities, which makes it hard to configure SLO-based timeouts. As upstream services often define latency objectives on a per-request basis, I would also like to configure my request timeouts on a per request basis.

To be more specific: I am neither interested in a particular connections.connect-timeout nor a specific connections.socket-timeout as long as the entire HTTP request finishes within a given time.

Unfortunately the timeouts.global is too global on the other hand because it spans multiple request, which means that configuring something like this...

clients:
    example:
      connections:
        connect-timeout: 25 milliseconds
        socket-timeout: 25 milliseconds
      retry:
        enabled: true
        max-retries: 2
      timeouts:
        enabled: true
        global: 150 milliseconds

... could still result in the global timeout kicking in before any retry happens.

Possible Implementation

a per call timeout (maybe via a plugin)
policy composition autoconfig possibilities

Your Environment

Version used: 3.0.0-RC.15
pretty standard Springboot app (2.3.7.RELEASE)

Jan 14 '21 15:01 jbspeakr

Currently it appears to me that there are only super low-level or super high-level timeout configuration possibilities, which makes it hard to configure SLO-based timeouts. As upstream services often define latency objectives on a per-request basis, I would also like to configure my request timeouts on a per request basis.

The way it was supposed to be used is to split your client up into multiple clients if there is a need to separate configuration within a client:

riptide.clients:
  example-login:
      connections:
        connect-timeout: 25 milliseconds
        socket-timeout: 25 milliseconds
      retry:
        enabled: true
        max-retries: 2
      timeouts:
        enabled: true
        global: 150 milliseconds
  example-download:
      connections:
        connect-timeout: 25 milliseconds
        socket-timeout: 500 milliseconds
      retry:
        enabled: true
        max-retries: 2
      timeouts:
        enabled: true
        global: 30 seconds

From my perspective, what is missing is a way to define a time limit for a complete HTTP call (only), including resolving DNS, connecting, writing the request body, server processing, as well as reading the response body.

That's what connect-timeout (resolving DNS, connecting) and socket-timeout (server processing, as well as reading the response body) are for. Writing the request body is not subject to either of those two timeouts, because it's the client's responsibility, not the servers.

Global timeouts allow you to manage your own SLAs (what you promise your own clients) while connect- and socket-timeouts (and to some degree backup-requests) allow you to deal with SLAs of your downstream dependencies.

Jan 14 '21 15:01 whiskeysierra

The way it was supposed to be used is to split your client up into multiple clients

That's understood and also what we do.

That's what connect-timeout (resolving DNS, connecting) and socket-timeout (server processing, as well as reading the response body) are for.

Please correct me if I am wrong, but setting connect-timeout: 25 milliseconds and socket-timeout: 25 milliseconds will not result in a "(call) timeout" of 50ms as the socket timeout (SO_TIMEOUT) is defined as the "maximum period inactivity between two consecutive data packets" (link). It's a common misconception that a socket timeout is the timeout to receive the full response. This IMO means that with riptide in its current form you cant easily (if at all) "fail fast and retry", because your initial request might eat up all your global timeout budget without any retry ever kicking in.

What I want my application to do is to not wait longer than e.g. 50ms per request for a specific client and if this 50ms "call timeout" budget was eaten up, trigger one of my retries (which then has another 50ms guaranteed "call timeout" budget but not more)... Similar to what you would achieve with changing policy composition from the current default Timeout(RetryPolicy(CircuitBreaker(Supplier))) (simplified) to RetryPolicy(CircuitBreaker(Timeout(Supplier))) (simplified).

Jan 14 '21 16:01 jbspeakr

Please correct me if I am wrong, but setting connect-timeout: 25 milliseconds and socket-timeout: 25 milliseconds will not result in a "(call) timeout" of 50ms as the socket timeout (SO_TIMEOUT) is defined as the "maximum period inactivity between two consecutive data packets" (link).

Yes, correct.

This IMO means that with riptide in its current form you cant easily (if at all) "fail fast and retry", because your initial request might eat up all your global timeout budget without any retry ever kicking in.

You can with a backup request. The following will issue a second, concurrent request if the first one took longer than 50 milliseconds. Backup requets are essentially latency-based retries. Sounds like what you want in this case (because your strategy to counter them is retrying).

riptide.clients:
    example:
      connections:
        connect-timeout: 25 milliseconds
        socket-timeout: 25 milliseconds
      backup-request:
        enabled: true
        delay: 50 milliseconds
      timeouts:
        enabled: true
        global: 150 milliseconds

Similar to what you would achieve with changing policy composition from the current default Timeout(RetryPolicy(CircuitBreaker(Supplier))) (simplified) to RetryPolicy(CircuitBreaker(Timeout(Supplier))) (simplified).

Correct, that would give you what you want. There is actually no need to remove the outer most Timeout. Both serve a purpose.

So you'd like to have this, then?

riptide.clients:
    example:
      connections:
        connect-timeout: 25 milliseconds
        socket-timeout: 25 milliseconds
      retry:
        enabled: true
        max-retries: 2
      timeouts:
        enabled: true
        call: 50 milliseconds
        global: 160 milliseconds # 3 calls, each 50ms + some time to issue a new request

This would only really be useful for scenarios where there is no retry desired, i.e. really fail fast.

Personally, I don't see a big value in failing because something took to long only to immediately retry it. Your client will have to wait either way.

Jan 14 '21 16:01 whiskeysierra

Yes, we also just discussed the usage of backup-request, but that is limited to one additional request, right?

Correct, that would give you what you want. There is actually no need to remove the outer most Timeout. Both serve a purpose.

Correct, I simplified to make my point clear.

I don't see a big value in failing because something took to long only to immediately retry it. Your client will have to wait either way.

I think the assumption here is: If a call already took longer than expected latency (e.g. p999 + 10% = 25ms) the possibility that it will finish any time soon is rather low. So better stop waiting and issue a new request, reusing the open connection. But you are right, sounds like nice usecase for a backup call.

Edit: Allowing to configure "local (call) timeouts" in addition to "global timeouts" seems like a good addition nevertheless.

Jan 14 '21 16:01 jbspeakr

Yes, we also just discussed the usage of backup-request, but that is limited to one additional request, right?

Correct.

I think the assumption here is: If a call already took longer than expected latency (e.g. p999 + 10% = 25ms) the possibility that it will finish any time soon is rather low. So better stop waiting and issue a new request, reusing the open connection. But you are right, sounds like nice usecase for a backup call.

Backup requests would run concurrently and give the original request a chance to still finish. The first request to finish will be used. Might as well be the original request.

Allowing to configure "local (call) timeouts" in addition to "global timeouts" seems like a good addition nevertheless.

Are you interested in contributing this? The way I see it, it's essentially just an addition to the auto configuration, because all the building blocks are already there (FailsafePlugin+ Timeout policy).

Jan 15 '21 09:01 whiskeysierra

@whiskeysierra Yes, I believe we have a couple of guys who would like to contribute.

Jan 15 '21 10:01 jbspeakr

@jbspeakr I saw you made a commit to your fork. Dare to open a PR?

May 14 '21 12:05 whiskeysierra

riptide riptide copied to clipboard

Configurable Call Timeout

Detailed Description

Context

Possible Implementation

Your Environment

riptide
riptide copied to clipboard