operator-controller
operator-controller copied to clipboard
:sparkles: Use error type rather than strings
Description
Reviewer Checklist
- [ ] API Go Documentation
- [ ] Tests: Unit Tests (and E2E Tests, if appropriate)
- [ ] Comprehensive Commit Messages
- [ ] Links to related GitHub Issue(s)
Codecov Report
Attention: Patch coverage is 56.66667% with 13 lines in your changes missing coverage. Please review.
Project coverage is 77.93%. Comparing base (
9f0c6a9) to head (928e1ba).
| Files | Patch % | Lines |
|---|---|---|
| internal/controllers/common_controller.go | 0.00% | 8 Missing :warning: |
| ...nternal/controllers/clusterextension_controller.go | 77.27% | 5 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## main #878 +/- ##
==========================================
+ Coverage 76.21% 77.93% +1.72%
==========================================
Files 17 17
Lines 1177 1142 -35
==========================================
- Hits 897 890 -7
+ Misses 202 175 -27
+ Partials 78 77 -1
| Flag | Coverage Δ | |
|---|---|---|
| e2e | 59.80% <46.66%> (+1.60%) |
:arrow_up: |
| unit | 60.14% <53.33%> (+2.12%) |
:arrow_up: |
Flags with carried forward coverage won't be shown. Click here to find out more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
@m1kola (github doesn't support comment threads), I do believe we need to distinguish between Resolution errors, which likely involve errors in the ClusterExtension resource, vs. other errors that can occur on the cluster. In the first case, the user may need to fix it, and can likely do it themselves. In the other case, it may be a temporal error on the cluster, or something else that the user cannot handle.
PR needs rebase.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
Deploy Preview for olmv1 ready!
| Name | Link |
|---|---|
| Latest commit | 928e1baeb73a5948332d1da0a7ee1c5bbaa62099 |
| Latest deploy log | https://app.netlify.com/sites/olmv1/deploys/665f2fbc41879700082961cb |
| Deploy Preview | https://deploy-preview-878--olmv1.netlify.app |
| Preview on mobile | Toggle QR Code...Use your smartphone camera to open QR code link. |
To edit notification comments on pull requests, go to your Netlify site configuration.
@m1kola (github doesn't support comment threads), I do believe we need to distinguish between Resolution errors, which likely involve errors in the
ClusterExtensionresource, vs. other errors that can occur on the cluster. In the first case, the user may need to fix it, and can likely do it themselves. In the other case, it may be a temporal error on the cluster, or something else that the user cannot handle.
@tmshort handleResolutionErrors only handles resolution errors (only errors returned from r.resolve(ctx, *ext)).
How I see it:
- If something is wrong with
ClusterExtensionspec (e.g. we can't find a package) - we want to make sure that the error is very human readable and appears on the condition. Human can take an action based on the error message. - If it is something temporal (e.g. we timeout fetching bundles from catalogs) - then we should also put errors into the condition. Ideally adding some context for troubleshooting (e.g. not just "request timed out", but "error fetching bundles: request timed out", etc). Eventually temporal errors will be resolved during one of the attempts to reconcile.
- If it is something permanent - same as temporal. I believe at the moment we do not have any permanent/terminal conditions. So every error which appears to be "permanent" is logically just a temporal issue which has not been resolved yet.
In all the these cases we can fail resolution condition and write error into the condition message. I don't think that setting reason to InstallationStatusUnknown aids UX.
There is a chance that I'm missing something (e.g. maybe you expect programmatic clients to take advantage of a distinct combination of reasons and take some action?). Happy to discuss it further here or jump on a quick call if you like.
@ankitathomas's PR is super relevant to the discussion of temporal/transient vs permanent. https://github.com/operator-framework/operator-controller/pull/842
I think the Progressing condition that we've discussed should make it much easier for us to tell users "we're going to try again" vs "we're not trying again, there is something you need to fix"
@m1kola
handleResolutionErrorsonly handles resolution errors (only errors returned fromr.resolve(ctx, *ext)).
Non-resolution (e.g. timeout) errors can occur in r.resolve(), and thus any kind of errors are handled by handleResolutionErrors. Would it make more sense to remove the function (since it's a lot smaller now), and put the functionality into the main function?
This is just trying to clean up error processing. It's not meant to clean up the status conditions (but it has to do something with status conditions). It looks as though #842 changed title, so it's doing something different now, issue #880 is meant to clean up the status conditions. The hope is that we can have smaller changes (which is why I'm trying to keep this small).
This PR is focused on removing the string checks for error types, and all the aggregated errors that were being used - issues noted in the big Helm PR #846. It was not meant to fix all the problems with status conditions.