serving icon indicating copy to clipboard operation
serving copied to clipboard

RevisionSpec.Timeout has deviated from the original specification

Open dprotaso opened this issue 3 years ago • 11 comments

/area API

I dropped the ball regarding how we handled RevisionSpec.Timeout in the OSS implementation. We deviated from what was originally in the spec and opt'd to change the spec (https://github.com/knative/serving/pull/10806) rather than fix the behaviour.

We prioritized backwards compatibility over remaining conformant.

In hindsight this was a mistake since:

  1. This led us to introduced a new field to cover the functional gap - (#10851)
  2. Timeout and MaxDuration being so similar is confusing and bad UX for end-users
  3. Workloads running on the OSS implementation will behave differently on other compliant Knative offerings.

New plan is here: https://github.com/knative/serving/issues/12634#issuecomment-1137765904

~I think the best path forward here is:~

  • [x] ~https://github.com/knative/serving/pull/12635~
    • ~Yes this is poor form but it's only been out for two weeks in the latest release at the time of writing and hasn't been documented~
  • [ ] ~create a feature flag that enables the first byte timeout logic (on by default)~
  • [ ] ~notify users about this default and be explicit they should set it for compatibility~
  • [ ] ~after N releases we swap the default to be conformant to the spec's original intention~
  • [ ] ~For those who opt into the first-byte timeout we provide a maxDuration annotation so they still have that ability~
  • [ ] ~Allow a per-revision override via annotations so existing revisions can work independently of the operator settings~

dprotaso avatar Feb 16 '22 02:02 dprotaso

/assign @dprotaso

dprotaso avatar Feb 16 '22 02:02 dprotaso

It feels like "time to first response byte" is a valid but niche use case. Rather than introducing a flag to change the meaning of timeoutSeconds, have we considered introducing a firstByteMaxTime or the like?

Thinking about it, it seems like the main use case for fBMT is for server or bidirectional streaming scenarios, which are definitely not the default.

evankanderson avatar Feb 16 '22 18:02 evankanderson

(I don't like firstByteMaxTime as a name, for the record.)

It still seems useful to be able to specify a first but and max stream duration at the Revision level, to enable safe lame-duck operation and bound times for proxy-level drain and restart without complicated TCP handoffs.

evankanderson avatar Feb 16 '22 18:02 evankanderson

Rather than introducing a flag to change the meaning of timeoutSeconds, have we considered introducing a firstByteMaxTime or the like?

I had not. Though now I wonder if adding support for idleTimeout (https://github.com/knative/serving/issues/10852) would cover the niche case of first byte timeout

dprotaso avatar Feb 16 '22 20:02 dprotaso

Timeout and MaxDuration being so similar is confusing and bad UX for end-users

It is confusing inded. IMHO MaxDuration is a self-explanatory name and seems better that Timeout to me.

Should we capture this in a design doc and describe how previous cases can be handled with the "new" design? It would help also documenting some of the scenarios eg. "if I use websockets then..." (diagrams would be nice too to explain options).

skonto avatar Feb 17 '22 09:02 skonto

create a feature flag that enables the first byte timeout logic (on by default)

@dprotaso enable it to be taken into account by the timeout handler? what happens if its disabled, the behaviour uses the MaxDuration only and ignore first byte response?

nader-ziada avatar May 24 '22 19:05 nader-ziada

I'm also wondering about the plan -- do we mean a feature flag that makes timeoutSeconds mean timeoutOnResponseStart, or a feature-flag for a new parameter (idleTimeoutSeconds or startResponseTimeout)?

evankanderson avatar May 24 '22 21:05 evankanderson

so here is my proposal:

Based on the user choice, the timeout handler would use the FirstByte or MaxDuration.

and I think since the MaxDuration seems to be the more common use case, we can make it the default, but at the same time allow the user to choose FirstByte if needed.

This new behaviour can start behind a feature gate for a release or two, then be on by default.

We can allow the user to choose using an annotation on the revision?

nader-ziada avatar May 25 '22 19:05 nader-ziada

Circling back - our timeout to first byte logic is currently broken for web sockets. When using web sockets the connection will remain open for Timeout duration - which conforms to what the spec originally had.

user reported issue: https://knative.slack.com/archives/CA9RHBGJX/p1645566907713149

This simplifies the original plan since the earlier concern we had was about breaking compatibility with long running web sockets connections isn't an issue.

Thus all I think we need to do is:

  • [ ] https://github.com/knative/specs/pull/100
  • [ ] 1.5 release - Mention the upcoming change in the 1.5 release notes
  • [ ] 1.6 release - Fix our timeout handlers to max duration instead of time to first byte.
  • [ ] 1.6 release - Update our conformance tests to assert max duration behaviour of Timeout
  • [ ] 1.6 release - Add an e2e test to handle timeouts for different protocols

but at the same time allow the user to choose FirstByte if needed.

I think this would be covered as a new property - IdleTimeout - https://github.com/knative/serving/issues/10852

dprotaso avatar May 25 '22 19:05 dprotaso

Nader is going to drive this for the 1.6 release

/assign @nader-ziada /unassign @dprotaso

dprotaso avatar May 25 '22 20:05 dprotaso

With https://github.com/knative/serving/issues/11980 we should now be able to make the necessary changes for 1.7

dprotaso avatar Jul 11 '22 21:07 dprotaso