camel-k icon indicating copy to clipboard operation
camel-k copied to clipboard

Integration reporting error before the progress deadline timeout expires

Open lsergio opened this issue 1 year ago • 8 comments

What happened?

I have Camel-K running on an EKS cluster with autoscaling groups scaling up to 20 nodes. At the moment this was reported, 8 nodes were running, and I created a new Integration object.

There was no room for a new pod in the running nodes, so Kubernetes spawned a new one. However, the Integration reported as Error immediately, even before the Deployment progress deadline expired.

This is the Integration report:

  - lastTransitionTime: "2024-05-27T17:42:16Z"
    lastUpdateTime: "2024-05-27T17:42:16Z"
    message: '0/8 nodes are available: 1 node(s) were unschedulable, 7 Insufficient
      cpu. preemption: 0/8 nodes are available: 1 Preemption is not helpful for scheduling,
      7 No preemption victims found for incoming pod..'
    reason: Error
    status: "False"
    type: Ready

And this is the Deployment status:

status:
  conditions:
  - lastTransitionTime: "2024-05-27T17:42:16Z"
    lastUpdateTime: "2024-05-27T17:42:16Z"
    message: Deployment does not have minimum availability.
    reason: MinimumReplicasUnavailable
    status: "False"
    type: Available
  - lastTransitionTime: "2024-05-27T17:42:16Z"
    lastUpdateTime: "2024-05-27T17:42:16Z"
    message: ReplicaSet "deploy-4f2a232d-fec3-42ad-b437-b5c47fcf1804-copy-5dbc986949"
      is progressing.
    reason: ReplicaSetUpdated
    status: "True"
    type: Progressing

As we can see, the deployment is still progressing.

I expected the status to be Error only after the progress deadline expired.

Steps to reproduce

No response

Relevant log output

No response

Camel K version

2.2.0

lsergio avatar May 27 '24 18:05 lsergio

After the new node is created, the Integration status changes to Ready = true. The impact on my side is that the Error triggers the wrong workflow on my monitoring application.

lsergio avatar May 27 '24 18:05 lsergio

I think the correct way to monitor an healthy Integration is to watch both .status.phase and .conditions[READY]==true and ideally you should include the readiness probe via health trait to make sure that the Camel context is ready. This is because you probably don't want to trust blindly Kubernetes Deployment (which you can see it's not reporting an error status) but the Camel context, which is the application one knowing if something is healthy or not via its internal mechanisms.

squakez avatar May 28 '24 09:05 squakez

In this specific case, I think the Deployment is correct by not reporting an error before the deadline expires. Per the docs, the deployment status should change to ProgressDeadlineExceeded after the 10min default timeout (or the progressDeadlineSeconds values) expires . And it does:

  conditions:
  - lastTransitionTime: "2024-05-28T11:23:11Z"
    lastUpdateTime: "2024-05-28T11:23:11Z"
    message: Deployment does not have minimum availability.
    reason: MinimumReplicasUnavailable
    status: "False"
    type: Available
  - lastTransitionTime: "2024-05-28T11:24:12Z"
    lastUpdateTime: "2024-05-28T11:24:12Z"
    message: ReplicaSet "deploy-2f9e3f35-141a-46e6-a264-f5b82ad00adb-55659fcd77" has
      timed out progressing.
    reason: ProgressDeadlineExceeded
    status: "False"
    type: Progressing

For monitoring purposes, I have enabled the Health trait, and I consider the Integration to be healthy when the Ready condition is true and the KnativeServiceReady condition is also true when applicable. This allows me to detect when an Integration is successfully deployed.

My other use case, though, is to detect when an Integration is failing due to a bad component configuration that causes the CamelContext to not start. In this scenario, the Ready condition will be false, but I still need to check the reason or phase to distinguish between a Camel Context that is still starting up and one that has failed. When it fails, the reason changes to Error and I can trigger an alert.

Having that Error status also when the deployment is still in progress leads me to a false alert.

lsergio avatar May 28 '24 11:05 lsergio

I checked the source code and looks like the Deployment monitor is waiting for the progressdeadlineexceeded status to report an Integration error.

It seems there's something else causing the Integration to report an Error.

lsergio avatar May 28 '24 11:05 lsergio

After reading this method, I figured out what is happening:

While there's no available nodes, the integration pod status is Pending and it reports:

status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2024-05-28T12:00:15Z"
    message: '0/6 nodes are available: 6 Insufficient cpu. preemption: 0/6 nodes are
      available: 6 No preemption victims found for incoming pod..'
    reason: Unschedulable
    status: "False"
    type: PodScheduled
  phase: Pending
  qosClass: Burstable

The monitor detects the Unschedulable reason and sets the Error reason in the Integration. When the new node is ready, that condition changes to:

  - lastProbeTime: null
    lastTransitionTime: "2024-05-28T12:01:07Z"
    status: "True"
    type: PodScheduled

There probably is a good reason for checking the pending Pod statuses, but shouldn't it be enough to check the Deployment status? Any issue with the pods will (or should) reflect on the Deployment status.

For my specific monitoring case, I will try and check the Deployment status. If it is still Progressing, I will ignore the Error reason.

lsergio avatar May 28 '24 12:05 lsergio

The problem is that we need to know if the application is really starting or not, reason why we are checking the Pod as well. The Deployment would not report an application failure, but a "Deployment" failure (ie, cannot schedule for any reason).

squakez avatar May 28 '24 12:05 squakez

I see. Well, one suggestion I have is to check the Deployment status Progressing condition. It is true, keep checking the Pods, but do not check for the Unschedulable reason. This will catch any more severe condition, like an ImagePullbackoff. The Integration Ready condition would be false.

When the Deployment times out, the Deployment Progressing condition will change to False, and then we could check the Pods and get the error message from the Unschedulable ones, setting the Integration Ready condition to False with the Error reason.

lsergio avatar May 28 '24 12:05 lsergio

The reason behind using the Pod and not the Deployment is that, depending on a number of factor, camel-k can generate a Knative Service, a Job or a Deployment (and maybe a resource that I don't recall), which makes the Pod is the only common denominator among the generated resources.

However I think we do not handle such case very well and I agree, it should not mark the integration as erroed but likely as progressing or something along the line. That said, I don't know how complex it would be

lburgazzoli avatar May 28 '24 12:05 lburgazzoli