operator-controller icon indicating copy to clipboard operation
operator-controller copied to clipboard

:bug: don't set installed status unless it is successful

Open everettraven opened this issue 1 year ago β€’ 5 comments

Description

It was pointed out in an offline discussion that we still have some areas in our code where we may end up overwriting the Installed status condition fields from True to False. Namely:

  • https://github.com/operator-framework/operator-controller/blob/78b586aa721598e5da4b732572cbc1e93507daa9/internal/controllers/clusterextension_controller.go#L208-L210
  • https://github.com/operator-framework/operator-controller/blob/78b586aa721598e5da4b732572cbc1e93507daa9/internal/controllers/clusterextension_controller.go#L220

In order to fix this, this PR removes setting of the Installed status condition and associated status fields from anywhere in the existing code base except when installation has been successful.

This means that the Installed status condition will:

  • Be missing until the initial install is successful (this reduces duplication of information from the Installed status condition and the Progressing status condition)
  • Be set to True and populated with the most up-to-date information on successful installation and upgrades.

Reviewer Checklist

  • [ ] API Go Documentation
  • [ ] Tests: Unit Tests (and E2E Tests, if appropriate)
  • [ ] Comprehensive Commit Messages
  • [ ] Links to related GitHub Issue(s)

everettraven avatar Oct 11 '24 15:10 everettraven

Deploy Preview for olmv1 ready!

Name Link
Latest commit 21b2777f15c626cd88eb2b53cefbf809c2c53c57
Latest deploy log https://app.netlify.com/sites/olmv1/deploys/67094fd4df38de00080f3236
Deploy Preview https://deploy-preview-1368--olmv1.netlify.app
Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

netlify[bot] avatar Oct 11 '24 15:10 netlify[bot]

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 74.90%. Comparing base (78b586a) to head (21b2777). Report is 16 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1368      +/-   ##
==========================================
+ Coverage   74.83%   74.90%   +0.06%     
==========================================
  Files          42       42              
  Lines        2515     2502      -13     
==========================================
- Hits         1882     1874       -8     
+ Misses        449      445       -4     
+ Partials      184      183       -1     
Flag Coverage Ξ”
e2e 56.59% <ΓΈ> (-0.15%) :arrow_down:
unit 52.63% <ΓΈ> (-0.05%) :arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Oct 11 '24 15:10 codecov[bot]

@LalatenduMohanty I could be misremembering, but I was under the impression that merges in this repo default to using the PR title+description for the commit message.

If this is the incorrect understanding, I am happy to squash the commits here and write a better commit message.

everettraven avatar Oct 11 '24 19:10 everettraven

@everettraven Here is an an example https://github.com/operator-framework/operator-controller/commit/fba8473cfd98d99ea74972d7e8951de46ec16eea , the PR had two commits , while merging it squashed it in to one commit but it also kept the commit messages from all commits. So if you add a commit message for a commit it wont be lost after the commits are squashed.

LalatenduMohanty avatar Oct 11 '24 19:10 LalatenduMohanty

Overall the PR looks good to me. Once the we add a improved commit message (as mentioned in my previous comments) I can approve the PR.

LalatenduMohanty avatar Oct 11 '24 19:10 LalatenduMohanty

I'm not sure I agree with all of these changes. It seems like these move us away from level-based and into edge-based logic.

IMO, we should assess the current state of the world and update our status accordingly on every reconcile. We had an issue in catalogd recently that we had to fix to move from edge-based back to level-based logic.

I think there are some improvements we could make to prevent unnecessary flapping, but it seems counter to the eventually consistent assumptions of Kubernetes for us to make assumptions about what happened in the past and disregard what we're actually seeing in the present.

@joelanford while I understand what you are saying, and don't entirely disagree, where I am coming from with this PR is that we want to achieve the following with the Installed status condition:

  • Before an initial installation is successful, we want a user to interpret Installed as False OR Unknown.
  • After an initial installation is successful, we want a user to interpret Installed as True regardless of subsequent failures to upgrade - even if it results in a degredation of the initial installation

When I was looking into this, I was originally looking at this in the same lens of updating our status based on our knowledge of the current state of the world, but this seemed to introduce quite a bit of complexity that I wasn't confident would prevent false-negative flapping (i.e saying something isn't installed when it is). More specifically, these are the challenges I recall running into:

  • If we fail to fetch an installed bundle, receiving an error from the InstalledBundleGetter:
    • How can we be sure that we don't misinform end users when updating the Installed status?
    • Are we okay with duplicating information on the Installed and Progressing status conditions?
  • If we receive a nil BundleMetadata from the InstalledBundleGetter, which based on https://github.com/operator-framework/operator-controller/blob/e5ecec7bba1a5db51dba3a63cc0637ade4f7bdd3/internal/controllers/clusterextension_controller.go#L483-L491, is possible even if we found a release but it has some unknown status conditions:
    • Are we sure that we haven't actually attempted an install before? This problem in particular could probably be remedied by making it such that the InstalledBundleGetter expectation is that if it finds a release, regardless of state, the bundle metadata is returned.
  • If we "fix" the above point and receive a populated BundleMetadata from the InstalledBundleGetter alongside an error due to "invalid" status:
    • How do we ensure that the BundleMetadata we received is for an initial installation attempt vs a failed upgrade attempt?
    • If we add release revision to the BundleMetadata that gives a bit more insight, but my understanding is that revision is increased even on "upgrade" operations to a failed initial install. How can we be confident that a release revision range corresponds to "initial installation attempt"?
  • To potentially resolve the complexities I've called out above, we could add some kind of caching layer as an additional mechanism for cross-checking the information we've received from our InstalledBundleGetter (or the InstalledBundleGetter itself uses it), but then:
    • You add yet another caching layer
    • Should the cache be memory based? What happens if the container crashes and we lose that information? Would potentially overwriting the Installed status condition in this case be acceptable?
    • Should the cache be written to disk to preserve it between potential container crashes/restarts (probably)?

Hopefully my response here shows the complexity involved with managing the Installed condition in the way we've outlined in the RFC when taking a level-based approach. I'm by no means saying one way is right or wrong, but simply calling out the complexities involved.

Going the route of this PR, you resolve the two requirements I shared above by:

  • Not setting the Installed status condition until the installation has actually been successful (observed by reaching a specific part of our code). This means that failures up until that point will be reflected in the Progressing condition. Pairing a non-existent Installed condition and a present Progressing condition should give and end user enough insight to say "This ClusterExtension has not yet successfully installed, but is progressing towards the installed state". If the Installed condition is present and the Progressing condition is signaling errors an end user should be able to conclude that "This ClusterExtension has been successfully installed, but is currently failing to progress through an upgrade".
  • Only setting the Installed status condition when we know for sure that an installation or an upgrade has been successful ensures that we don't run the risk of providing a false-negative on the Installed condition based on our observed state being incomplete.

I'll re-iterate again, I'm not saying one way is right or wrong. I'm simply calling out that there is significant complexity involved in going the route of setting the Installed condition based on what our controller observes (which could be incomplete) and the approach of this PR has significantly less complexity. There is obviously a trade-off either way, but given the "time crunch" we've been under for shipping v1.0.0 I went with this approach because it:

  • Achieves the agreed upon requirements for the Installed condition
  • Is faster to implement and thus allows us to reach our v1.0.0 milestone faster
  • Doesn't prohibit us from spending more time on this to get back to a more level-based approach post-v1.0.0

everettraven avatar Oct 14 '24 14:10 everettraven

/hold adding a hold as we want to discuss the changes first.

LalatenduMohanty avatar Oct 14 '24 18:10 LalatenduMohanty

Superseded by #1379

everettraven avatar Oct 24 '24 13:10 everettraven