serving icon indicating copy to clipboard operation
serving copied to clipboard

Would it be terrible if: we default to no readiness probe if the user doesnt ask for one

Open julz opened this issue 4 years ago • 19 comments

/area API /area autoscale

Describe the feature

Today, if a user doesnt specify a readiness probe we introduce a default (tcp) probe for them. Since #10741 this is run via both a startup probe (using an intermediate exec probe in queue proxy), and as a readiness probe which runs after the startup probe succeeds. The reason we do this is to avoid flakes when scaling up, ensuring the pod is at least ready to accept tcp connections before routing to it.

Before StartupProbe was available, running this as a ReadinessProbe was really the only way to accomplish this (readiness was the only game in town even tho we only really, I think, cared about startup flakes). However, now that we have StartupProbe it might be possible to only run the tcp probe for the startupprobe, and then not run any readiness probe at all. This is technically a behaviour change since we would currently detect the pod going unready in this way after startup, but it (a) is a closer match to what the user actually asked for -- they didnt specify a probe at all, and wouldnt get one in raw k8s (b) would be a partial workaround for the issue in https://github.com/knative/serving/issues/10973 since we would avoid potentially missing the window to run the readiness probe after the startup probe completes (c) is simpler and more efficient (especially since, due to (b) we need the probe period to be as short as possible..). We also (d) might be able to do even better and do the startup probe in a PostStart lifecycle hook rather than an actual StartupProbe which might fix https://github.com/knative/serving/issues/10973 entirely (while still avoiding flakes on scale-up).

Note: this is a rough thought for discussion. It may (well) have a fatal flaw I haven't spotted yet.

julz avatar Mar 18 '21 13:03 julz

FWIW, I toyed a bit with a PostStart hook executing our binary and that seems to perform in the same ballpark as containers without probes at all.

markusthoemmes avatar Mar 18 '21 14:03 markusthoemmes

This has some great content on the tradeoffs we'd buy into when using lifecycle hooks: https://livebook.manning.com/book/kubernetes-in-action/chapter-17/64.

In a nutshell: Error reporting seems worse, so we prolly need to adjust the way we report errors from pods and the insight into why stuff fails seems to not be great. It'd also be interesting to doublecheck if the pod's IP is appearing as NonReadyAddress before the lifecycle runs, since it's going into a Pending state rather a NotReady state. I haven't checked on these pieces yet at all.

markusthoemmes avatar Mar 18 '21 14:03 markusthoemmes

Tripleposting :joy:

I did investigate the behavior of the postStart hook briefly, and sadly, I was right above: The pod isn't getting any IP while the post-start hook is running, so that'd destroy other optimizations that we have in the activator to completely bypass the readiness signals and just probe the container on our own and send traffic to it ASAP (without waiting for signals to propagate through the API for instance).

markusthoemmes avatar Mar 18 '21 15:03 markusthoemmes

It sounds like this doesn't work for right now?

/close

(Feel free to reopen and add more notes if this should stay open)

evankanderson avatar Mar 19 '21 17:03 evankanderson

@evankanderson: Closing this issue.

In response to this:

It sounds like this doesn't work for right now?

/close

(Feel free to reopen and add more notes if this should stay open)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

knative-prow-robot avatar Mar 19 '21 17:03 knative-prow-robot

/reopen

The bit that doesn't work is the postStart idea, which was a little side comment, not the main point of the issue (the thread's a bit confusing I guess since we side tracked ourselves a bit on a side issue). Tldr I still think it might make sense to avoid the unnecessary readiness check (every second x every pod) that the user didn't ask for since we only really need a startup check to do what we want (avoid flakes on scale up).

julz avatar Mar 19 '21 17:03 julz

@julz: Reopened this issue.

In response to this:

/reopen

The bit that doesn't work is the postStart idea, which was a little side comment, not the main point of the issue (the thread's a bit confusing I guess since we side tracked ourselves a bit on a side issue). Tldr I still think it might make sense to avoid the unnecessary readiness check (every second x every pod) that the user didn't ask for since we only really need a startup check to do what we want (avoid flakes on scale up).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

knative-prow-robot avatar Mar 19 '21 17:03 knative-prow-robot

/kind enhancement /remove-kind feature

Does this need a proposal before implementation?

/triage assigned

evankanderson avatar Mar 22 '21 00:03 evankanderson

@evankanderson: The label(s) triage/assigned cannot be applied, because the repository doesn't have them.

In response to this:

/kind enhancement /remove-kind feature

Does this need a proposal before implementation?

/triage assigned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

knative-prow-robot avatar Mar 22 '21 00:03 knative-prow-robot

/triage accepted

evankanderson avatar Mar 22 '21 00:03 evankanderson

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

github-actions[bot] avatar Jun 20 '21 01:06 github-actions[bot]

Not defaulting in a readiness probe would be useful for the container-freezer, as a readiness probe on a frozen container isn't likely to be very useful (and I'm pretty sure it doesn't make sense to wake up a frozen container every second for a probe :) )

psschwei avatar Nov 01 '21 19:11 psschwei

I don't have an intrinsic concern with this on second thought (my first thought was "oop, don't do that"). It does seem like the container-freezer might want to run the startup probe on un-freeze, and would still need to be able to handle containers with an explicit readiness check.

evankanderson avatar Nov 02 '21 13:11 evankanderson

Copying @julz 's comment on potentially swapping out the readiness probe for a startup probe: https://github.com/knative/serving/pull/12281#discussion_r750459804

psschwei avatar Nov 16 '21 20:11 psschwei

/assign

psschwei avatar Feb 04 '22 19:02 psschwei

... optimizations that we have in the activator to completely bypass the readiness signals and just probe the container on our own and send traffic to it ASAP (without waiting for signals to propagate through the API for instance).

Is there any documentation on this behavior or reasons behind it? I found it quite surprising and only figured it out after debugging why a Pod revision was not switching over to ready, despite the service container reporting as ready. Our service needs several minutes during startup to load a large data model, and the readinessProbe parameters we had chosen were causing funny behavior.

For the use case I have, the extra latency of checking the container status via API would not be a problem. It seems that asking the API for ready state would allow you to use exec probes too, and not change the behavior of the container probe config also.

xtaje avatar Jul 14 '22 12:07 xtaje

It's a bit of a long story (most of which was before my time, but there's a number of issues from early 2021 (?) that provide more details), but the short version is: containers are often "ready" before the API shows they are Ready, so checking the container directly ends up being faster.

psschwei avatar Jul 14 '22 13:07 psschwei

Not defaulting in a readiness probe would be useful for the container-freezer,

FWIW - If I was a user (using container-freezer) and I configure a readiness probe I would expect it to probe while the container is active. But I would expect it to be paused when the container is paused.

I'm unsure if that's possible in K8s right now

dprotaso avatar Aug 02 '22 19:08 dprotaso

Yeah, the case where a user specifically added a probe would be.... tricky, to say the least. I think the order of doing probe updates would go something like:

  • add default startup probe (#10037 )
  • remove default readiness probe (this one)
  • figure out how the heck to handle user readiness probes if freezer is enabled (no issue yet)

psschwei avatar Aug 02 '22 19:08 psschwei