operator-controller icon indicating copy to clipboard operation
operator-controller copied to clipboard

:sparkles: Use error type rather than strings

Open tmshort opened this issue 1 year ago • 1 comments

Description

Reviewer Checklist

  • [ ] API Go Documentation
  • [ ] Tests: Unit Tests (and E2E Tests, if appropriate)
  • [ ] Comprehensive Commit Messages
  • [ ] Links to related GitHub Issue(s)

tmshort avatar May 21 '24 16:05 tmshort

Codecov Report

Attention: Patch coverage is 56.66667% with 13 lines in your changes missing coverage. Please review.

Project coverage is 77.93%. Comparing base (9f0c6a9) to head (928e1ba).

Files Patch % Lines
internal/controllers/common_controller.go 0.00% 8 Missing :warning:
...nternal/controllers/clusterextension_controller.go 77.27% 5 Missing :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #878      +/-   ##
==========================================
+ Coverage   76.21%   77.93%   +1.72%     
==========================================
  Files          17       17              
  Lines        1177     1142      -35     
==========================================
- Hits          897      890       -7     
+ Misses        202      175      -27     
+ Partials       78       77       -1     
Flag Coverage Δ
e2e 59.80% <46.66%> (+1.60%) :arrow_up:
unit 60.14% <53.33%> (+2.12%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar May 21 '24 16:05 codecov[bot]

@m1kola (github doesn't support comment threads), I do believe we need to distinguish between Resolution errors, which likely involve errors in the ClusterExtension resource, vs. other errors that can occur on the cluster. In the first case, the user may need to fix it, and can likely do it themselves. In the other case, it may be a temporal error on the cluster, or something else that the user cannot handle.

tmshort avatar May 29 '24 15:05 tmshort

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-merge-robot avatar May 30 '24 19:05 openshift-merge-robot

Deploy Preview for olmv1 ready!

Name Link
Latest commit 928e1baeb73a5948332d1da0a7ee1c5bbaa62099
Latest deploy log https://app.netlify.com/sites/olmv1/deploys/665f2fbc41879700082961cb
Deploy Preview https://deploy-preview-878--olmv1.netlify.app
Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

netlify[bot] avatar May 30 '24 20:05 netlify[bot]

@m1kola (github doesn't support comment threads), I do believe we need to distinguish between Resolution errors, which likely involve errors in the ClusterExtension resource, vs. other errors that can occur on the cluster. In the first case, the user may need to fix it, and can likely do it themselves. In the other case, it may be a temporal error on the cluster, or something else that the user cannot handle.

@tmshort handleResolutionErrors only handles resolution errors (only errors returned from r.resolve(ctx, *ext)).

How I see it:

  • If something is wrong with ClusterExtension spec (e.g. we can't find a package) - we want to make sure that the error is very human readable and appears on the condition. Human can take an action based on the error message.
  • If it is something temporal (e.g. we timeout fetching bundles from catalogs) - then we should also put errors into the condition. Ideally adding some context for troubleshooting (e.g. not just "request timed out", but "error fetching bundles: request timed out", etc). Eventually temporal errors will be resolved during one of the attempts to reconcile.
  • If it is something permanent - same as temporal. I believe at the moment we do not have any permanent/terminal conditions. So every error which appears to be "permanent" is logically just a temporal issue which has not been resolved yet.

In all the these cases we can fail resolution condition and write error into the condition message. I don't think that setting reason to InstallationStatusUnknown aids UX.

There is a chance that I'm missing something (e.g. maybe you expect programmatic clients to take advantage of a distinct combination of reasons and take some action?). Happy to discuss it further here or jump on a quick call if you like.

m1kola avatar Jun 03 '24 09:06 m1kola

@ankitathomas's PR is super relevant to the discussion of temporal/transient vs permanent. https://github.com/operator-framework/operator-controller/pull/842

I think the Progressing condition that we've discussed should make it much easier for us to tell users "we're going to try again" vs "we're not trying again, there is something you need to fix"

joelanford avatar Jun 03 '24 14:06 joelanford

@m1kola

handleResolutionErrors only handles resolution errors (only errors returned from r.resolve(ctx, *ext)).

Non-resolution (e.g. timeout) errors can occur in r.resolve(), and thus any kind of errors are handled by handleResolutionErrors. Would it make more sense to remove the function (since it's a lot smaller now), and put the functionality into the main function?

This is just trying to clean up error processing. It's not meant to clean up the status conditions (but it has to do something with status conditions). It looks as though #842 changed title, so it's doing something different now, issue #880 is meant to clean up the status conditions. The hope is that we can have smaller changes (which is why I'm trying to keep this small).

tmshort avatar Jun 03 '24 15:06 tmshort

This PR is focused on removing the string checks for error types, and all the aggregated errors that were being used - issues noted in the big Helm PR #846. It was not meant to fix all the problems with status conditions.

tmshort avatar Jun 03 '24 15:06 tmshort