camel-k icon indicating copy to clipboard operation
camel-k copied to clipboard

Operator is stuck in a "deploying" phase loop when internal deployment fails indefinitely

Open mdebarros opened this issue 1 year ago • 5 comments

What happened?

When deploying an Integration using the Camel K Operator that fails a deployment due a either a missing dependency or perhaps an underlying API incompatibility, the Integration will be stuck in the Deploying phase.

The issue here being that the Integration "phase" never goes into an Error state even if the cause of the deployment failure is not self-healable (i.e. API incompatibility as an example).

I have two examples of when/where this occurs:

  1. Deploying an Integration to Camel K Operator with startupProbe enable on the Health Trait (Ref) when using Knative serving v1 API which DOES NOT support startupProbes (Ref)
│   Type     Reason                       Age               From                            Message                                                │
│   ----     ------                       ----              ----                            -------                                                │
│   Warning  IntegrationError             1s (x11 over 6s)  camel-k-integration-controller  Cannot reconcile Integration template-connector-1: err │
│ or executing post actions: error during apply resource: gc-generic/template-connector-1: failed to create typed patch object (gc-generic/templat │
│ e-connector-1; serving.knative.dev/v1, Kind=Service): .spec.template.spec.containers[0].startupProbe: field not declared in schema
  1. Deploying an Integration to Camel K Operator with a missing dependency (e.g. ConfigMap)
│   Type     Reason            Age                     From                            Message                                                     │
│   ----     ------            ----                    ----                            -------                                                     │
│   Warning  IntegrationError  2m46s (x288 over 3d3h)  camel-k-integration-controller  Cannot reconcile Integration template-connector-1: error du │
│ ring trait customization: master trait configuration failed: ConfigMap "template-connector-1-openapi-000" not found

In both cases The Camel K Operator reports the above issues as a warning type of event when describing the deployment. Not sure if this should rather be classified as an error instead, or perhaps at least move to error after X number of retries/backoff attempts?

I believe its crucial that an Integration should "eventually" end up an error state if the issue cannot be self-healed as per the above example use-cases.

Steps to reproduce

  1. Setup a K8s cluster with the following
    1. Camel K Operator v2.1.0
    2. Knative service v1.12.0
  2. Deploy a Camel K Integration with the Health Trait enabled (health.enabled=true), and the startupProbe enabled (health.startup-probe-enabled=true)

Relevant log output

│   Last Init Timestamp:  2024-01-18T12:48:11Z                                                                                                     │
│   Observed Generation:  1                                                                                                                        │
│   Phase:                Deploying                                                                                                                │
│   Platform:             camel-k                                                                                                                  │
│   Profile:              Knative                                                                                                                  │
│   Runtime Provider:     quarkus                                                                                                                  │
│   Runtime Version:      3.2.0                                                                                                                    │
│   Version:              2.1.0                                                                                                                    │
│ Events:                                                                                                                                          │
│   Type     Reason                       Age               From                            Message                                                │
│   ----     ------                       ----              ----                            -------                                                │
│   Normal   IntegrationConditionChanged  9s                camel-k-integration-controller  Condition "IntegrationPlatformAvailable" is "True" for │
│  Integration template-connector-1: camel-k/camel-k                                                                                               │
│   Normal   IntegrationPhaseUpdated      9s                camel-k-integration-controller  Integration "template-connector-1" in phase "Initializ │
│ ation"                                                                                                                                           │
│   Normal   IntegrationPhaseUpdated      6s (x2 over 9s)   camel-k-integration-controller  Integration "template-connector-1" in phase "Building  │
│ Kit"                                                                                                                                             │
│   Normal   IntegrationConditionChanged  6s                camel-k-integration-controller  Condition "IntegrationKitAvailable" is "True" for Inte │
│ gration template-connector-1: kit-cmkhrk8ve68c73e60i30                                                                                           │
│   Normal   IntegrationPhaseUpdated      6s                camel-k-integration-controller  Integration "template-connector-1" in phase "Deploying │
│ "                                                                                                                                                │
│   Warning  IntegrationError             1s (x11 over 6s)  camel-k-integration-controller  Cannot reconcile Integration template-connector-1: err │
│ or executing post actions: error during apply resource: gc-generic/template-connector-1: failed to create typed patch object (gc-generic/templat │
│ e-connector-1; serving.knative.dev/v1, Kind=Service): .spec.template.spec.containers[0].startupProbe: field not declared in schema
│   conditions:                                                                                                                                    │
│   - firstTruthyTime: "2024-01-18T12:48:11Z"                                                                                                      │
│     lastTransitionTime: "2024-01-18T12:48:11Z"                                                                                                   │
│     lastUpdateTime: "2024-01-18T12:48:11Z"                                                                                                       │
│     message: camel-k/camel-k                                                                                                                     │
│     reason: IntegrationPlatformAvailable                                                                                                         │
│     status: "True"                                                                                                                               │
│     type: IntegrationPlatformAvailable                                                                                                           │
│   - firstTruthyTime: "2024-01-18T12:48:14Z"                                                                                                      │
│     lastTransitionTime: "2024-01-18T12:48:14Z"                                                                                                   │
│     lastUpdateTime: "2024-01-18T12:48:14Z"                                                                                                       │
│     message: kit-cmkhrk8ve68c73e60i30                                                                                                            │
│     reason: IntegrationKitAvailable                                                                                                              │
│     status: "True"                                                                                                                               │
│     type: IntegrationKitAvailable

Camel K version

2.1.0

mdebarros avatar Jan 19 '24 09:01 mdebarros

Thanks for reporting, I can confirm this behavior. We will have a look at this.

claudio4j avatar Jan 19 '24 16:01 claudio4j

As a workaround, can you use the integration with health disabled ?

claudio4j avatar Jan 19 '24 16:01 claudio4j

As a workaround, can you use the integration with health disabled ?

Hey @claudio4j,

Thanks so much for validating the issue.

We can just disable the startupProbe (traits.health.startupProbeEnabled: false)...

However, that is not really the main issue in my opinion.

It's more so that the Integration is stuck in a "deploying" loop indefinitely, and never ends up in an error state.

Just to confirm, is the above described "deploying" behaviour on such deployment failure scenarios intended?

Take note that If you were deploying Integrations using some kind of automated CI process, and it was stuck in this kind of loop....it would be quite hard to determine/report on this deployment failure since technically the Integration never fails. Currently, it would require a someone to be aware of the deployment, and eyeball the issue.

mdebarros avatar Jan 22 '24 09:01 mdebarros

is the above described "deploying" behaviour on such deployment failure scenarios intended?

Not on purpose. As you noted above, the camel-k-operator tries to set a field startupProbe on a knative service, which is not supported and the error is not handled, leaving the Integration object with the wrong status. In the knative-service trait we have to handle this scenario (health enabled) and not set that field. We have to document this case and have testing in place.

claudio4j avatar Jan 22 '24 16:01 claudio4j

I can take a look at this.

realMartinez avatar Feb 15 '24 10:02 realMartinez

@realMartinez are you actively working on this?

squakez avatar Mar 15 '24 14:03 squakez

@realMartinez are you actively working on this?

I have been working on another issue, I have put thison hold for now

realMartinez avatar Mar 18 '24 10:03 realMartinez

Okey. Removed the assignment as I may have a look at this.

squakez avatar Mar 18 '24 10:03 squakez