serving
serving copied to clipboard
Curling a pod with failed readinessProbe should not hang
What version of Knative?
all recent versions
Expected Behavior
Curling a pod with a readiness probe set (with periodSeconds=0 to avoid the behaviour in https://github.com/knative/serving/issues/10764), should either error immediately or succeed due to us scaling things up / terminating the pod (needs discussion about what the right behaviour is!).
Actual Behavior
Curling a pod with a failed readiness check hangs (unless there are other scaled-up pods for the request). The pod is never killed, because there is no liveness probe, but the set of routable pods is empty. In upstream k8s you would get a "no healthy upstream" error, in knative your curl hangs 😢.
Steps to Reproduce the Problem
ko apply -f https://github.com/julz/readybutnotsteady/blob/main/service-no-period.yaml- Wait for startup to succeed
curl readynotsteady.default.$domain/start-failing- Wait a few seconds for the pod to go unready
curl readynotsteady.default.$domain(hangs).
/area networking /area autoscale /triage accepted
It sounds like we have a test case for this, so it just needs hands to work it?
I think we need to settle on what the right behaviour is here: hanging seems pretty wrong and unexpectedly different from k8s, but should we scale up (so, ignore non-ready pods for purposes of desired, at least eventually), terminate the pod (so effectively introduce an implicit liveness check), error out after timeout if nothing goes ready again, or just document that readiness probes are a real bad idea for knative :/ ?.
This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.
/remove-lifecycle stale. /reopen /lifecycle frozen
To clarify, the behavior is to hang if the activator is in the request path.
I just ran a setup with target-burst-capacity: 0 configured globally, and was startled that timing out (the behavior with TBC=200) was a conformance requirement. 🤔
Given the following comment, I am inclined to relax things to "requests fail", so that conformance doesn't have a hard requirement that things match the upstream semantics bug-for-bug: https://github.com/knative/serving/blob/80be89b59e2dd2ed5c351cec9d729fa1b4e450a3/test/conformance/runtime/readiness_probe_test.go#L163-L165
I'm slightly confused by conformance having anything to say about "curl the pod", since conformance shouldn't talk about Pods at all (and some implementations don't have them).
In fact, I don't see why this is in the conformance tests at all, and it seems entirely conformant to start a second pod if the first pod is reporting not ready.