serving icon indicating copy to clipboard operation
serving copied to clipboard

How to detect a permanent service failure?

Open lsergio opened this issue 1 year ago • 0 comments

Ask your question here:

Hi there.

I'm facing a situation where I need to detect that a Knative Service will never be Ready because its Deployment progress deadline expired. This would happen, for example, when my cluster has no more resources to create new pods.

When I create the Knative Service and check its status, I see the conditions:

    conditions:
    - lastTransitionTime: "2024-08-21T12:33:36Z"
      message: 'Revision "rest-1-00001" failed with message: 0/2 nodes are available:
        2 Too many pods. preemption: 0/2 nodes are available: 2 No preemption victims
        found for incoming pod..'
      reason: RevisionFailed
      status: "False"
      type: ConfigurationsReady
    - lastTransitionTime: "2024-08-21T12:33:36Z"
      message: Configuration "rest-1" does not have any ready Revision.
      reason: RevisionMissing
      status: "False"
      type: Ready
    - lastTransitionTime: "2024-08-21T12:33:36Z"
      message: Configuration "rest-1" does not have any ready Revision.
      reason: RevisionMissing
      status: "False"
      type: RoutesReady

The Ready condition is False with RevisionMissing reason.

After the progress deadline expires. I see the conditions:

    conditions:
    - lastTransitionTime: "2024-08-21T12:29:42Z"
      message: 'Revision "rest-1-00001" failed with message: Initial scale was never
        achieved.'
      reason: RevisionFailed
      status: "False"
      type: ConfigurationsReady
    - lastTransitionTime: "2024-08-21T12:27:11Z"
      message: Configuration "rest-1" does not have any ready Revision.
      reason: RevisionMissing
      status: "False"
      type: Ready
    - lastTransitionTime: "2024-08-21T12:27:11Z"
      message: Configuration "rest-1" does not have any ready Revision.
      reason: RevisionMissing
      status: "False"
      type: RoutesReady

The messages have changed, but the reasons are still the same.

What would be the recommended way of detecting that the Revision failed definitely without relying on parsing error messages?

Thanks for any help!

lsergio avatar Aug 21 '24 12:08 lsergio