serving
serving copied to clipboard
We should allow setting an Idle timeout
/area API
Describe the feature
see also #10851 (issue for max duration timeout).
We changed the TimeoutSeconds field at some point to only be a timeout to first byte. This was to avoid killing idle websocket connections. However an actual idle timeout would be useful to guard against infinite-looping or deadlocked requests costing money forever, or even just to ensure that an old revision with an active websocket doesn't run forever and never scale down (see https://github.com/knative/serving/issues/10835).
To avoid problems with websockets this would default to 0 = infinite timeout.
/triage accepted
This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen
. Mark the issue as
fresh by adding the comment /remove-lifecycle stale
.
/remove-lifecycle stale
Hey @julz I'd love to investigate into this issue. Assigning myself, but feel free to let me know if there is another plan already.
/assign
We may want to wait for https://github.com/knative/serving/issues/10851 to land before picking this one up to avoid potentially solving some of the same/similar problems twice
(Unless @skonto things they're independent enough to progress separately)
@julz @nealhu my plan is to add a case
, select statement here (as a side note case order is not deterministic if more than one is ready) so I can check for the max duration and if so return a timeout using the same logic used for the first byte one. In any case we need to think about defaults as well for both features.
Now idle timeout counting could be done by resetting a timer each time there is write and failing if the timer expires without a write. Another option is to have a dedicated handler for each scenario which has the benefit that PR work is isolated. Any scenario wfm btw as I can always rebase. However lets choose a design so each solution follows the same pattern. Are there any other timeouts missing? WDYTH?
Some background info (@julz correct me if I am wrong):
Right now the timeout for the first byte timer is defined as timeout := time.Duration(env.RevisionTimeoutSeconds) * time.Second
. RevisionTimeoutSeconds is always smaller than MaxRevisionTimeoutSeconds. Both are also used at the ingress side . Btw one thing that never happened is QP timeout = Activator timeout, for more check here.
Yes I agree we should sync on this, I did some investigation on this and here is some findings & thoughts:
- The concept of idle varies on different protocols.
- For HTTP 1.x, there is the keep alive setting. I.e. the idle time between two HTTP requests. The connection gets closed/recycled if the time exceeds some threshold.
- For HTTP 2, there is some healthcheck mechanism that sends periodical ping frames, if there is no response within a certain time, the connection will be closed.
- For Websocket, there is a similar ping/pong mechanism to keep the connection open.
- For TCP, it's the foundation of all above and it has an independent timeout mechanism. It will send periodical keep-alive packet and it closes the connection after a certain numbers of keep-alive packets.
- As @skonto pointed out, we have the ability to add timers to the wrapper of
http.ResponseWriter
. This will be a new type of timeout, somewhere in between TCP and higher-level protocols. The definition of the timeout is like: the max interval allowed between the TCP layer surfaces raw bytes to HTTP layer. - I don't think we can have full control on TCP idle timeout, it might also depends on the underlying OS.
- We have some abilities to tweak idle timeouts in application layer. For example, queue proxy is built on top of this
http.Server
https://github.com/knative/pkg/blob/main/network/h2c.go#L30 , we could set IdleTimeout for both HTTP 1.x and HTTP 2 there. I haven't found where we can change the websocket configs yet.
The advantage of the method mentioned in 2 is: the change seems to apply to all protocols, even for websockets. If we do this, I'd vote for making separate handlers for each scenario if that doesn't add too much perf overhead. Because the code will be cleaner and easier to test and reuse.
The advantage of the method mentioned in 3 i: the meaning of timeouts will align with the protocols and we are not inventing our own concepts.
I might be overthinking this :D let me know what you guys think.
Hey @julz @skonto , I think we should settle on a plan. Based on the discussion above, I think the pending questions are:
- What would be the defaults for these timeouts
- Do we want to add timeout at to
http.Server
or use a similar logic as the first byte timeout handler - Do we want to add more timeout logic on top of the first byte timeout handler, or make each type of timeout a separate handler
My personal preference is that:
- The default should be no timeout as suggested in the issue description, maybe use negative values to represent that. This would minimize breaking backward compatibility.
- Use handler to implement timeouts, because it applies to all sort of protocols.
- Make different handlers for different kind of timeouts for reusability and testability.
Thoughts?
Agree on 1,2. For 3, regarding reusability I would refactor it if we need to reuse any of these handlers elsewhere (afaik no immediate need).
Regarding testability I see these two timeouts as fixes/enhancements for timeToFirstByteTimeoutHandler
and should be straightforward to expand the tests in timeout_test.go
. Also makes sense to test combinations of the timeout settings at one place imho. If we use different handlers I suspect we will need to abstract away some handler details in timeToFirstByteTimeoutHandler
and test by mixing the handlers as well.
Anyway, on Monday I will be on PTO for a couple of weeks so @nealhu pls feel free to start the work here. I will rebase on top of what you add.
Hey @julz I made a WIP PR and added some questions there, let's move the discussion there https://github.com/knative/serving/pull/11791
Hey @julz @skonto , just made a PR to expose idle timeout to revision and higher levels, please take a look https://github.com/knative/serving/pull/11988 As mentioned in that PR description and in slack, we will need to discuss more on validation but I think that can be a different PR.
/unassign @nealhu /assign @nader-ziada who's working on this
done as part of https://github.com/knative/serving/pull/12970
/close
@nader-ziada: Closing this issue.
In response to this:
done as part of https://github.com/knative/serving/pull/12970
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.