:bug: don't set installed status unless it is successful
Description
It was pointed out in an offline discussion that we still have some areas in our code where we may end up overwriting the Installed status condition fields from True to False. Namely:
- https://github.com/operator-framework/operator-controller/blob/78b586aa721598e5da4b732572cbc1e93507daa9/internal/controllers/clusterextension_controller.go#L208-L210
- https://github.com/operator-framework/operator-controller/blob/78b586aa721598e5da4b732572cbc1e93507daa9/internal/controllers/clusterextension_controller.go#L220
In order to fix this, this PR removes setting of the Installed status condition and associated status fields from anywhere in the existing code base except when installation has been successful.
This means that the Installed status condition will:
- Be missing until the initial install is successful (this reduces duplication of information from the
Installedstatus condition and theProgressingstatus condition) - Be set to
Trueand populated with the most up-to-date information on successful installation and upgrades.
Reviewer Checklist
- [ ] API Go Documentation
- [ ] Tests: Unit Tests (and E2E Tests, if appropriate)
- [ ] Comprehensive Commit Messages
- [ ] Links to related GitHub Issue(s)
Deploy Preview for olmv1 ready!
| Name | Link |
|---|---|
| Latest commit | 21b2777f15c626cd88eb2b53cefbf809c2c53c57 |
| Latest deploy log | https://app.netlify.com/sites/olmv1/deploys/67094fd4df38de00080f3236 |
| Deploy Preview | https://deploy-preview-1368--olmv1.netlify.app |
| Preview on mobile | Toggle QR Code...Use your smartphone camera to open QR code link. |
To edit notification comments on pull requests, go to your Netlify site configuration.
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 74.90%. Comparing base (
78b586a) to head (21b2777). Report is 16 commits behind head on main.
Additional details and impacted files
@@ Coverage Diff @@
## main #1368 +/- ##
==========================================
+ Coverage 74.83% 74.90% +0.06%
==========================================
Files 42 42
Lines 2515 2502 -13
==========================================
- Hits 1882 1874 -8
+ Misses 449 445 -4
+ Partials 184 183 -1
| Flag | Coverage Ξ | |
|---|---|---|
| e2e | 56.59% <ΓΈ> (-0.15%) |
:arrow_down: |
| unit | 52.63% <ΓΈ> (-0.05%) |
:arrow_down: |
Flags with carried forward coverage won't be shown. Click here to find out more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
@LalatenduMohanty I could be misremembering, but I was under the impression that merges in this repo default to using the PR title+description for the commit message.
If this is the incorrect understanding, I am happy to squash the commits here and write a better commit message.
@everettraven Here is an an example https://github.com/operator-framework/operator-controller/commit/fba8473cfd98d99ea74972d7e8951de46ec16eea , the PR had two commits , while merging it squashed it in to one commit but it also kept the commit messages from all commits. So if you add a commit message for a commit it wont be lost after the commits are squashed.
Overall the PR looks good to me. Once the we add a improved commit message (as mentioned in my previous comments) I can approve the PR.
I'm not sure I agree with all of these changes. It seems like these move us away from level-based and into edge-based logic.
IMO, we should assess the current state of the world and update our status accordingly on every reconcile. We had an issue in catalogd recently that we had to fix to move from edge-based back to level-based logic.
I think there are some improvements we could make to prevent unnecessary flapping, but it seems counter to the eventually consistent assumptions of Kubernetes for us to make assumptions about what happened in the past and disregard what we're actually seeing in the present.
@joelanford while I understand what you are saying, and don't entirely disagree, where I am coming from with this PR is that we want to achieve the following with the Installed status condition:
- Before an initial installation is successful, we want a user to interpret
InstalledasFalseORUnknown. - After an initial installation is successful, we want a user to interpret
InstalledasTrueregardless of subsequent failures to upgrade - even if it results in a degredation of the initial installation
When I was looking into this, I was originally looking at this in the same lens of updating our status based on our knowledge of the current state of the world, but this seemed to introduce quite a bit of complexity that I wasn't confident would prevent false-negative flapping (i.e saying something isn't installed when it is). More specifically, these are the challenges I recall running into:
- If we fail to fetch an installed bundle, receiving an error from the
InstalledBundleGetter:- How can we be sure that we don't misinform end users when updating the
Installedstatus? - Are we okay with duplicating information on the
InstalledandProgressingstatus conditions?
- How can we be sure that we don't misinform end users when updating the
- If we receive a
nilBundleMetadatafrom theInstalledBundleGetter, which based on https://github.com/operator-framework/operator-controller/blob/e5ecec7bba1a5db51dba3a63cc0637ade4f7bdd3/internal/controllers/clusterextension_controller.go#L483-L491, is possible even if we found a release but it has some unknown status conditions:- Are we sure that we haven't actually attempted an install before? This problem in particular could probably be remedied by making it such that the
InstalledBundleGetterexpectation is that if it finds a release, regardless of state, the bundle metadata is returned.
- Are we sure that we haven't actually attempted an install before? This problem in particular could probably be remedied by making it such that the
- If we "fix" the above point and receive a populated
BundleMetadatafrom theInstalledBundleGetteralongside an error due to "invalid" status:- How do we ensure that the
BundleMetadatawe received is for an initial installation attempt vs a failed upgrade attempt? - If we add release revision to the
BundleMetadatathat gives a bit more insight, but my understanding is that revision is increased even on "upgrade" operations to a failed initial install. How can we be confident that a release revision range corresponds to "initial installation attempt"?
- How do we ensure that the
- To potentially resolve the complexities I've called out above, we could add some kind of caching layer as an additional mechanism for cross-checking the information we've received from our
InstalledBundleGetter(or theInstalledBundleGetteritself uses it), but then:- You add yet another caching layer
- Should the cache be memory based? What happens if the container crashes and we lose that information? Would potentially overwriting the
Installedstatus condition in this case be acceptable? - Should the cache be written to disk to preserve it between potential container crashes/restarts (probably)?
Hopefully my response here shows the complexity involved with managing the Installed condition in the way we've outlined in the RFC when taking a level-based approach. I'm by no means saying one way is right or wrong, but simply calling out the complexities involved.
Going the route of this PR, you resolve the two requirements I shared above by:
- Not setting the
Installedstatus condition until the installation has actually been successful (observed by reaching a specific part of our code). This means that failures up until that point will be reflected in theProgressingcondition. Pairing a non-existentInstalledcondition and a presentProgressingcondition should give and end user enough insight to say "This ClusterExtension has not yet successfully installed, but is progressing towards the installed state". If theInstalledcondition is present and theProgressingcondition is signaling errors an end user should be able to conclude that "This ClusterExtension has been successfully installed, but is currently failing to progress through an upgrade". - Only setting the
Installedstatus condition when we know for sure that an installation or an upgrade has been successful ensures that we don't run the risk of providing a false-negative on theInstalledcondition based on our observed state being incomplete.
I'll re-iterate again, I'm not saying one way is right or wrong. I'm simply calling out that there is significant complexity involved in going the route of setting the Installed condition based on what our controller observes (which could be incomplete) and the approach of this PR has significantly less complexity. There is obviously a trade-off either way, but given the "time crunch" we've been under for shipping v1.0.0 I went with this approach because it:
- Achieves the agreed upon requirements for the
Installedcondition - Is faster to implement and thus allows us to reach our v1.0.0 milestone faster
- Doesn't prohibit us from spending more time on this to get back to a more level-based approach post-v1.0.0
/hold adding a hold as we want to discuss the changes first.
Superseded by #1379