operator-lifecycle-manager icon indicating copy to clipboard operation
operator-lifecycle-manager copied to clipboard

Sanitize usage of `status.Conditions` in olm owned APIs according to accepted kube community standards

Open anik120 opened this issue 3 years ago • 1 comments

Feature Request

Is your feature request related to a problem? Please describe.

The Subscription API uses status.Conditions to communicate the current state of the Subscription. For example, when a Subscription is created for a non-existent channel, the condition ResolutionFailed communicates that there was an error in resolving the subscription:

- lastTransitionTime: "2022-04-07T21:31:25Z"
      message: all available catalogsources are healthy
      reason: AllCatalogSourcesHealthy
      status: "False"
      type: CatalogSourcesUnhealthy
    - message: 'constraints not satisfiable: no operators found in channel stabl-1.5
        of package camel-k in the catalog referenced by subscription camelk, subscription
        camelk exists'
      reason: ConstraintsNotSatisfiable
      status: "True"
      type: ResolutionFailed

However, when a correct channel is mentioned in the Subscription object (i.e there's no error in resolving), the conditions do not communicate anything about the status of the resolution:

When the InstallPlan is created:

conditions:
    - lastTransitionTime: "2022-04-07T21:35:47Z"
      message: all available catalogsources are healthy
      reason: AllCatalogSourcesHealthy
      status: "False"
      type: CatalogSourcesUnhealthy
    - lastTransitionTime: "2022-04-07T21:35:50Z"
      message: 'unpack job not completed: Unpack pod(olm/3f908d6cf414c2c4b69cf457647886998c2e45e04d696fd2922b2afd50l9scd)
        container(pull) is pending. Reason: PodInitializing, Message: .'
      reason: Installing
      status: "True"
      type: InstallPlanPending

When the InstallPlan has succeeded:

conditions:
    - lastTransitionTime: "2022-04-07T21:35:47Z"
      message: all available catalogsources are healthy
      reason: AllCatalogSourcesHealthy
      status: "False"
      type: CatalogSourcesUnhealthy

According to the standard followed by the Kube community, there are two problems with the way these conditions are laid out today:

a) The more problematic issue:

According to the doc:

Controllers should apply their conditions to a resource the first time they visit the resource, even if the status is Unknown. This allows other components in the system to know that the condition exists and the controller is making progress on reconciling that resource.

So when other components in the system want to know if the Subscription has failed due to CatalogSourceUnhealthy, the condition is there for them to get that answer from. That makes sense since

Controllers should apply their conditions to a resource the first time they visit the resource

When the Subscription controller checks the CatalogSource, it applies the condition with either true or false.

However, when the other components want to know if the Subscription has failed due to a resolution error, it cannot reliable find the answer. The only time the components can get that answer is when the resolution has actually failed.

According to the kube doc:

Not all controllers will observe the previous advice about reporting "Unknown" or "False" values. For known conditions, the absence of a condition status should be interpreted the same as Unknown, and typically indicates that reconciliation has not yet finished (or that the resource state may not yet be observable).

the component querying the Subscription conditions would interpret the ResolutionFailed as "resolution has not yet finished(or that the resource state may not yet be observable)", which is an incorrect representation of the Subscription when it is in the Succeeded phase.

b) Another issue:

According to the doc:

Condition type names should describe the current observed state of the resource, rather than describing the current state transitions. This typically means that the name should be an adjective ("Ready", "OutOfDisk") or a past-tense verb ("Succeeded", "Failed") rather than a present-tense verb ("Deploying"). Intermediate states may be indicated by setting the status of the condition to Unknown.

So the condition InstallPlanPending is breaking that rule all the way through. Going by that ☝🏽 rule, that condition should probably be InstallPlanSucceeded condition, with False and Reason: InstallPlanPending Message: unpack job not completed: Unpack pod(olm/3f908d6cf414c2c4b69cf457647886998c2e45e04d696fd2922b2afd50l9scd) container(pull) is pending. Reason: PodInitializing, Message: .

Describe the solution you'd like

Sanitize the usage of these conditions to reflect the current state of the object. So a failed subscription due to resolution error should look like:

- lastTransitionTime: "2022-04-07T21:31:25Z"
      message: all available catalogsources are healthy
      reason: AllCatalogSourcesHealthy
      status: "False"
      type: CatalogSourcesUnhealthy
    - message: 'constraints not satisfiable: no operators found in channel stabl-1.5
        of package camel-k in the catalog referenced by subscription camelk, subscription
        camelk exists'
      reason: ConstraintsNotSatisfiable
      status: "True"
      type: ResolutionFailed

a subscription that has a pending Installplan should look like:

conditions:
    - lastTransitionTime: "2022-04-07T21:35:47Z"
      message: all available catalogsources are healthy
      reason: AllCatalogSourcesHealthy
      status: "False"
      type: CatalogSourcesUnhealthy
    - lastTransitionTime: "2022-04-07T21:35:50Z"
      message: 'unpack job not completed: Unpack pod(olm/3f908d6cf414c2c4b69cf457647886998c2e45e04d696fd2922b2afd50l9scd)
        container(pull) is pending. Reason: PodInitializing, Message: .'
      reason: Installing
      status: "False"
      type: InstallPlanSucceeded

and a successful subscription should look like:

conditions:
    - lastTransitionTime: "2022-04-07T21:35:47Z"
      message: all available catalogsources are healthy
      reason: AllCatalogSourcesHealthy
      status: "False"
      type: CatalogSourcesUnhealthy
    - lastTransitionTime: "2022-04-07T21:35:47Z"
      message: 'resources created for operator installation'
      reason: CreatedCSV/s
      status: "True"
      type: InstallPlanSucceeded
    - lastTransitionTime: "2022-04-07T21:35:47Z"
      message: 'all constraints satisfied'
      reason: ConstrainstSatisfied
      status: "False"
      type: ResolutionFailed

The goal of this being:

Conditions should be added to explicitly convey properties that users and components care about rather than requiring those properties to be inferred from other observations

Right now, a successful resolution has to be inferred by querying for the CSV, which a) is a bad UX b) these other components might not have the capability for, which translates to "go build this capability into your component just for OLM APIs even though every other owned API don't need you to do that"

Open question:

It doesn't look like it'll take too much effort to solve problem a) stated above. But for problem b), i.e the InstallPlanPending being a condition that is describing a state transition, in the present tense, and warranting to be changed to InstallPlanSucceeded, the API guarantee clause needs to be discussed

Once defined, the meaning of a Condition can not be changed arbitrarily - it becomes part of the API, and has the same backwards- and forwards-compatibility concerns of any other part of the API.

Stretch goal: Audit and sanitize usage of condition for all OLM owned APIs (InstallPlan, CSV, etc)

anik120 avatar Apr 08 '22 14:04 anik120

This seems like a part of the OLM tech debt effort that we should address.

In the newer APIs, we are doing a much better job with status management and updating conditions, so I think this issue will not be as prevalent.

exdx avatar May 12 '22 19:05 exdx