rest-guide icon indicating copy to clipboard operation
rest-guide copied to clipboard

Standard HTTP headers for request timeout propagation

Open jpraet opened this issue 9 months ago • 5 comments

Managing request timeouts in a distributed microservice architecture can be quite challenging. Configuring fixed timeout values between each component is often not an ideal approach. Better would be to propagate a dynamically decreasing timeout value, or an absolute deadline timestamp, throughout the call chain. This enables a downstream component to decide to stop processing a request when it determines the upstream component has already gone in timeout. It prevents unnecessarily spending resources on handling a request for which the response would never reach the client anyway, because it already timed out.

Propagating an absolute deadline timestamp has some advantages:

  • properly accounts for network or request queueing delays
  • no recalculation needed: once calculated at entrypoint, the deadline can be passed along unchanged
  • simple to check: compare deadline against current timestamp

But the caveat is that it relies on synchronized clocks throughout the system.

Therefore, I would like to propose two HTTP headers:

  • BelGov-Request-Timeout: a timeout value in seconds (e.g. "30"), suitable between system boundaries for which clock synchronization cannot be guaranteed
  • BelGov-Request-Deadline: an absolute timestamp (e.g. "Thu, 06 Mar 2025 16:56:32 GMT"), for propagation within a clock-synchronized system

The typical usage would be that external clients can optionally add the BelGov-Request-Timeout HTTP header to their API request, to indicate how many seconds they are prepared to wait for a response. An inbound API gateway could translate that to a BelGov-Request-Deadline header (current timestamp + BelGov-Request-Timeout), which will be propagate to components within the system. Outbound API gateway could translate the BelGov-Request-Deadline back into a BelGov-Request-Timeout (calculating the number of seconds remaining until the deadline) when sending requests to other external systems, to inform them how long the API gateway is prepared to wait for a response.

When a component determines there is not enough time available to complete the request in the allotted timeout budget, it should return a HTTP 504 Gateway Timeout response. Maybe a urn:problem-type:belgif:gatewayTimeout problem type could be defined as well.

I don't think the guidelines should force institutions to implement these timeout-related HTTP headers. It's rather a standardization in the sense of: "IF you want to implement propagating timeouts, please use these standard headers instead of inventing your own".

Some related resources:

  • https://www.ietf.org/archive/id/draft-thomson-hybi-http-timeout-00.html
  • https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/504
  • https://sre.google/sre-book/addressing-cascading-failures/#xref_cascading-failure_latency-and-deadlines
  • https://grpc.io/docs/guides/deadlines/
  • https://userver.tech/d6/d64/md_en_2userver_2deadline__propagation.html

[!NOTE] The proposed headers currently only have second-precision, for 2 reasons:

  • The Request-Timeout header in the RFC draft is also expressed in seconds
  • RFC2616 prescribes "Thu, 06 Mar 2025 16:56:32 GMT" as preferred format for timestamps in HTTP headers

Is second-precision sufficient, or do we require millisecond-precision?

jpraet avatar Mar 17 '25 14:03 jpraet

I'm wondering if it's worthwhile to standardize such a pattern. Measures to guarantee server stability can implemented by user quotas, circuit breakers or bulkheads that enforce SLAs. Client response times are mostly a design-time issue: if a system isn't designed to respect response time expectations, users won't be happy either way. Checking deadlines runtime, is an optimization during server instability but at quite a high cost. Do you see any other possible benefits?

Some challenges to implement such a system:

  • propagating and respecting the deadlines throughout all infra and applications is invasive and costly : there doesn't seem to be widespread implementation of this pattern for HTTP in servers or frameworks (only in GRPC AFAIK)
  • when a request is waiting to be processed (e.g. waiting on thread available from thread pool, outgoing TCP ports full , ...), it is difficult to measure the waiting delay, or to remove it from the waiting queue based on the header value
  • the network delay to transfer a request or response can't be taken into account in case of duration-based deadline/timeout

Even if the client won't handle a response due to deadline exceeded, it can still be desirable to continue processing:

  • if a (POST) action is still performed but the response exceeds the client's deadline, the client can check the result with a GET later
  • the server can cache a response so next retry will be faster

Taking into account these nuances when determining deadlines is hard.

pvdbosch avatar Mar 19 '25 11:03 pvdbosch

All valid points. At CBSS we are currently still on the fence about whether or not we will implement such dynamic timeout propagation solution, as it is indeed quite complex. It's definitely not the goal to enforce this pattern in the REST guide, but rather to propose a standard header to facilitate / streamline for organizations that do wish to implement such solution (avoiding organization-specific headers).

Even when configuring a static request timeout of let's say 10 seconds, doesn't it have some added value to communicate this in the request to the supplier, allowing them to act on it if they wish?

jpraet avatar Mar 21 '25 14:03 jpraet

We can ask around in the WG if organizations already have experience with such solutions. I'd prefer not to standardize until the usefulness has been proven in practice, to avoid bloating the REST guide too much.

pvdbosch avatar Mar 28 '25 11:03 pvdbosch

@jpraet , could you add links to other implementations of this pattern?

other WG members only have experience with statically configured timeouts on a service call.

pvdbosch avatar Mar 28 '25 13:03 pvdbosch

Some frameworks that support timeout / deadline propagation:

Some articles on the subject:

And a youtube video mentioning deadline propagation in the context of Microservices architecture at Netflix.

jpraet avatar Apr 11 '25 11:04 jpraet