serving icon indicating copy to clipboard operation
serving copied to clipboard

Sidecar container's status isn't propagated to Kservice

Open seongpyoHong opened this issue 2 years ago • 5 comments

In what area(s)?

/area API

What version of Knative?

0.11.x

Expected Behavior

When I use multiple containers, one with no ports is not ready, Kservice also becomes Ready=False.

Actual Behavior

I use multiple containers, one with no ports is not ready.

$ kgp
NAME                                                       READY   STATUS             RESTARTS          AGE
failed-multi-container-00001-deployment-75675ff7fd-lswwl   2/3     CrashLoopBackOff   299 (4m54s ago)   25h

However, Kservice's Ready is True

$  k get ksvc
NAME                     URL                         LATESTCREATED                  LATESTREADY                    READY   REASON
failed-multi-container   https://<internal-domain>   failed-multi-container-00001   failed-multi-container-00001   True

Steps to Reproduce the Problem

Use this template.

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: failed-multi-container
spec:
  template:
    spec:
      containerConcurrency: 0
      containers:
      - image: <always-failed-container>
        name: failed-container
      - image: <success-container>
        name: success-container
        ports:
        - containerPort: 17123
          protocol: TCP
        readinessProbe:
          successThreshold: 1
          tcpSocket:
            port: 0
      timeoutSeconds: 300
  traffic:
  - latestRevision: true
    percent: 100

Question

Is there a reason to consider only containers with ports when calculating ContainerHealth Condition in Revision?

https://github.com/knative/serving/blob/40ebfb69d55aff2e5fc60da77063599d4a3ea61c/pkg/reconciler/revision/reconcile_resources.go#L102-L117

I think checking the status of all containers that are not queue-proxy will solve this problem.

seongpyoHong avatar Oct 26 '23 10:10 seongpyoHong

/triage accepted

dprotaso avatar Oct 26 '23 15:10 dprotaso

@dprotaso Can I work on it if you don't mind?

seongpyoHong avatar Oct 27 '23 00:10 seongpyoHong

@dprotaso I've found something strange.

When I reproduce the above situation, the Kservice is Ready=Unknown for a while (not sure how long, but maybe a few minutes) after it is created. After a few minutes, the KService changes to Ready=True.

When we check the status of the child resources, we see the following differences.

When KService Ready=Unknown

  • Revision's ResourceAvailable and ContainerHealthy Conditions are Unknown(Reason: Deploying)
  • PA's ScaleTargetInitialized is Unknow

When KService Ready=True

  • Revision's ResourceAvailable and ContainerHealthy Conditions are True
  • PA's ScaleTargetInitialized is True

Question

In code, https://github.com/knative/serving/blob/7c92928e39eac33564b0bbc65ce53fd0e9cdcbb8/pkg/apis/serving/v1/revision_lifecycle.go#L190-L198

It seems that the PA's ScaleTargetInitialized needs to be True to make the Revision's ResourceAvailable, ContainerHealthy Conditions True, which in turn makes the Kservice Ready=True. (Please let me know if I'm wrong)

The Kservice I'm having issues with (which I forgot to explain above) is using hpa. Checking the hpa controller code, it appears that the PA's ScaleTargetInitialized is only changed to True if the SKS is Ready=True.

https://github.com/knative/serving/blob/7c92928e39eac33564b0bbc65ce53fd0e9cdcbb8/pkg/reconciler/autoscaling/hpa/hpa.go#L93-L105

However, SKS is always Ready=False, regardless of KService's Ready.

  conditions:
  - lastTransitionTime: "2023-10-27T05:54:52Z"
    message: Revision is backed by Activator
    reason: ActivatorEndpointsPopulated
    severity: Info
    status: "True"
    type: ActivatorEndpointsPopulated
  - lastTransitionTime: "2023-10-27T05:54:52Z"
    message: K8s Service is not ready
    reason: NoHealthyBackends
    status: Unknown
    type: EndpointsPopulated
  - lastTransitionTime: "2023-10-27T05:54:52Z"
    message: K8s Service is not ready
    reason: NoHealthyBackends
    status: Unknown
    type: Ready
  observedGeneration: 1
  privateServiceName: failed-multi-container-00001-private
  serviceName: failed-multi-container-00001

Is there an expected reason for this behavior?

seongpyoHong avatar Oct 28 '23 06:10 seongpyoHong

@dprotaso sorry for bothering you. Can you check above situation?

seongpyoHong avatar Nov 29 '23 01:11 seongpyoHong

Hey @seongpyoHong - do you mind retesting your scenario with the latest serving release that came out two weeks ago (https://github.com/knative/serving/releases/tag/knative-v1.13.1)

The PR in that release https://github.com/knative/serving/pull/14795 includes a fix to how we propagate status conditions.

In my testing with your example - I see the Knative Service is in a 'failed' state and it remains like that.

dprotaso avatar Feb 17 '24 04:02 dprotaso

I confirmed this is still a problem.

dprotaso avatar Mar 08 '24 20:03 dprotaso