serving icon indicating copy to clipboard operation
serving copied to clipboard

We should allow setting an Idle timeout

Open julz opened this issue 3 years ago • 12 comments

/area API

Describe the feature

see also #10851 (issue for max duration timeout).

We changed the TimeoutSeconds field at some point to only be a timeout to first byte. This was to avoid killing idle websocket connections. However an actual idle timeout would be useful to guard against infinite-looping or deadlocked requests costing money forever, or even just to ensure that an old revision with an active websocket doesn't run forever and never scale down (see https://github.com/knative/serving/issues/10835).

To avoid problems with websockets this would default to 0 = infinite timeout.

julz avatar Feb 26 '21 09:02 julz

/triage accepted

evankanderson avatar Mar 15 '21 23:03 evankanderson

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

github-actions[bot] avatar Jun 14 '21 01:06 github-actions[bot]

/remove-lifecycle stale

julz avatar Jun 14 '21 07:06 julz

Hey @julz I'd love to investigate into this issue. Assigning myself, but feel free to let me know if there is another plan already.

/assign

nealhu avatar Aug 03 '21 02:08 nealhu

We may want to wait for https://github.com/knative/serving/issues/10851 to land before picking this one up to avoid potentially solving some of the same/similar problems twice

(Unless @skonto things they're independent enough to progress separately)

julz avatar Aug 03 '21 06:08 julz

@julz @nealhu my plan is to add a case, select statement here (as a side note case order is not deterministic if more than one is ready) so I can check for the max duration and if so return a timeout using the same logic used for the first byte one. In any case we need to think about defaults as well for both features. Now idle timeout counting could be done by resetting a timer each time there is write and failing if the timer expires without a write. Another option is to have a dedicated handler for each scenario which has the benefit that PR work is isolated. Any scenario wfm btw as I can always rebase. However lets choose a design so each solution follows the same pattern. Are there any other timeouts missing? WDYTH?

Some background info (@julz correct me if I am wrong): Right now the timeout for the first byte timer is defined as timeout := time.Duration(env.RevisionTimeoutSeconds) * time.Second. RevisionTimeoutSeconds is always smaller than MaxRevisionTimeoutSeconds. Both are also used at the ingress side . Btw one thing that never happened is QP timeout = Activator timeout, for more check here.

skonto avatar Aug 03 '21 20:08 skonto

Yes I agree we should sync on this, I did some investigation on this and here is some findings & thoughts:

  1. The concept of idle varies on different protocols.
    • For HTTP 1.x, there is the keep alive setting. I.e. the idle time between two HTTP requests. The connection gets closed/recycled if the time exceeds some threshold.
    • For HTTP 2, there is some healthcheck mechanism that sends periodical ping frames, if there is no response within a certain time, the connection will be closed.
    • For Websocket, there is a similar ping/pong mechanism to keep the connection open.
    • For TCP, it's the foundation of all above and it has an independent timeout mechanism. It will send periodical keep-alive packet and it closes the connection after a certain numbers of keep-alive packets.
  2. As @skonto pointed out, we have the ability to add timers to the wrapper of http.ResponseWriter. This will be a new type of timeout, somewhere in between TCP and higher-level protocols. The definition of the timeout is like: the max interval allowed between the TCP layer surfaces raw bytes to HTTP layer.
  3. I don't think we can have full control on TCP idle timeout, it might also depends on the underlying OS.
  4. We have some abilities to tweak idle timeouts in application layer. For example, queue proxy is built on top of this http.Server https://github.com/knative/pkg/blob/main/network/h2c.go#L30 , we could set IdleTimeout for both HTTP 1.x and HTTP 2 there. I haven't found where we can change the websocket configs yet.

The advantage of the method mentioned in 2 is: the change seems to apply to all protocols, even for websockets. If we do this, I'd vote for making separate handlers for each scenario if that doesn't add too much perf overhead. Because the code will be cleaner and easier to test and reuse.

The advantage of the method mentioned in 3 i: the meaning of timeouts will align with the protocols and we are not inventing our own concepts.

I might be overthinking this :D let me know what you guys think.

nealhu avatar Aug 03 '21 22:08 nealhu

Hey @julz @skonto , I think we should settle on a plan. Based on the discussion above, I think the pending questions are:

  1. What would be the defaults for these timeouts
  2. Do we want to add timeout at to http.Server or use a similar logic as the first byte timeout handler
  3. Do we want to add more timeout logic on top of the first byte timeout handler, or make each type of timeout a separate handler

My personal preference is that:

  1. The default should be no timeout as suggested in the issue description, maybe use negative values to represent that. This would minimize breaking backward compatibility.
  2. Use handler to implement timeouts, because it applies to all sort of protocols.
  3. Make different handlers for different kind of timeouts for reusability and testability.

Thoughts?

nealhu avatar Aug 06 '21 01:08 nealhu

Agree on 1,2. For 3, regarding reusability I would refactor it if we need to reuse any of these handlers elsewhere (afaik no immediate need). Regarding testability I see these two timeouts as fixes/enhancements for timeToFirstByteTimeoutHandler and should be straightforward to expand the tests in timeout_test.go. Also makes sense to test combinations of the timeout settings at one place imho. If we use different handlers I suspect we will need to abstract away some handler details in timeToFirstByteTimeoutHandler and test by mixing the handlers as well. Anyway, on Monday I will be on PTO for a couple of weeks so @nealhu pls feel free to start the work here. I will rebase on top of what you add.

skonto avatar Aug 06 '21 08:08 skonto

Hey @julz I made a WIP PR and added some questions there, let's move the discussion there https://github.com/knative/serving/pull/11791

nealhu avatar Aug 09 '21 02:08 nealhu

Hey @julz @skonto , just made a PR to expose idle timeout to revision and higher levels, please take a look https://github.com/knative/serving/pull/11988 As mentioned in that PR description and in slack, we will need to discuss more on validation but I think that can be a different PR.

nealhu avatar Sep 14 '21 03:09 nealhu

/unassign @nealhu /assign @nader-ziada who's working on this

dprotaso avatar Aug 06 '22 19:08 dprotaso

done as part of https://github.com/knative/serving/pull/12970

/close

nader-ziada avatar Sep 16 '22 15:09 nader-ziada

@nader-ziada: Closing this issue.

In response to this:

done as part of https://github.com/knative/serving/pull/12970

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

knative-prow[bot] avatar Sep 16 '22 15:09 knative-prow[bot]