serving Curling a pod with failed readinessProbe should not hang

trafficstars

What version of Knative?

all recent versions

Expected Behavior

Curling a pod with a readiness probe set (with periodSeconds=0 to avoid the behaviour in https://github.com/knative/serving/issues/10764), should either error immediately or succeed due to us scaling things up / terminating the pod (needs discussion about what the right behaviour is!).

Actual Behavior

Curling a pod with a failed readiness check hangs (unless there are other scaled-up pods for the request). The pod is never killed, because there is no liveness probe, but the set of routable pods is empty. In upstream k8s you would get a "no healthy upstream" error, in knative your curl hangs 😢.

Steps to Reproduce the Problem

ko apply -f https://github.com/julz/readybutnotsteady/blob/main/service-no-period.yaml
Wait for startup to succeed
curl readynotsteady.default.$domain/start-failing
Wait a few seconds for the pod to go unready
curl readynotsteady.default.$domain (hangs).

Feb 12 '21 17:02 julz

/area networking /area autoscale /triage accepted

It sounds like we have a test case for this, so it just needs hands to work it?

Mar 15 '21 23:03 evankanderson

I think we need to settle on what the right behaviour is here: hanging seems pretty wrong and unexpectedly different from k8s, but should we scale up (so, ignore non-ready pods for purposes of desired, at least eventually), terminate the pod (so effectively introduce an implicit liveness check), error out after timeout if nothing goes ready again, or just document that readiness probes are a real bad idea for knative :/ ?.

Mar 22 '21 11:03 julz

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

Jun 21 '21 01:06 github-actions[bot]

/remove-lifecycle stale. /reopen /lifecycle frozen

Jul 08 '21 20:07 julz

To clarify, the behavior is to hang if the activator is in the request path.

I just ran a setup with target-burst-capacity: 0 configured globally, and was startled that timing out (the behavior with TBC=200) was a conformance requirement. 🤔

Given the following comment, I am inclined to relax things to "requests fail", so that conformance doesn't have a hard requirement that things match the upstream semantics bug-for-bug: https://github.com/knative/serving/blob/80be89b59e2dd2ed5c351cec9d729fa1b4e450a3/test/conformance/runtime/readiness_probe_test.go#L163-L165

Jun 12 '22 18:06 mattmoor

I'm slightly confused by conformance having anything to say about "curl the pod", since conformance shouldn't talk about Pods at all (and some implementations don't have them).

In fact, I don't see why this is in the conformance tests at all, and it seems entirely conformant to start a second pod if the first pod is reporting not ready.

Jun 14 '22 20:06 evankanderson

serving serving copied to clipboard

Curling a pod with failed readinessProbe should not hang

What version of Knative?

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

serving
serving copied to clipboard