cluster-api-provider-openstack icon indicating copy to clipboard operation
cluster-api-provider-openstack copied to clipboard

🐛 openstackmachine: do not set transient error message and reason

Open seanschneeweiss opened this issue 3 years ago • 3 comments

What this PR does / why we need it:

With this PR we set failureReason & failureMessage of the OpenStackMachine for the following three cases:

  • Instance not found
  • Instance in ERROR state
    • as pointed out in https://github.com/kubernetes-sigs/cluster-api-provider-openstack/issues/1116#issuecomment-1029792653, unsure if we should really stop reconciliation. However, keeping this to remain with the same behavior.
  • json marshalling or unmarshalling error in the network status

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged): Fixes #1116

Special notes for your reviewer:

  • Introduced separate functions for setting the failureReason& failureMessage to have a choice when setting the failureReason compared to just setting UpdateMachineError for everything.
  • openstackcluster will pe part of a separate PR
  1. Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

TODOs:

  • [x] squashed commits
  • if necessary:
    • [ ] includes documentation
    • [ ] adds unit tests

/hold Sean Schneeweiss [email protected], Mercedes-Benz Tech Innovation GmbH, Provider Information

seanschneeweiss avatar Jul 18 '22 23:07 seanschneeweiss

Deploy Preview for kubernetes-sigs-cluster-api-openstack ready!

Name Link
Latest commit b9689c61933b5bd5947f6518a1ef345d75645ce9
Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-cluster-api-openstack/deploys/636a0f0ad771bd000923fea3
Deploy Preview https://deploy-preview-1301--kubernetes-sigs-cluster-api-openstack.netlify.app
Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

netlify[bot] avatar Jul 18 '22 23:07 netlify[bot]

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: seanschneeweiss

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • ~~OWNERS~~ [seanschneeweiss]

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar Jul 18 '22 23:07 k8s-ci-robot

WDYT about catching authentication errors for all cases and reconcile them again? @bavarianbidi

Authentication failed relates to error code 401. https://github.com/gophercloud/gophercloud/blob/0226ea5963656bcd435ab17af475e2345e523fcb/errors.go#L181-L183

It seems that only error codes >= 500 are considered retryable errors (as of now). https://github.com/kubernetes-sigs/cluster-api-provider-openstack/blob/505b5d27c834e5929e5f076cdfe47ff6b97de381/pkg/utils/errors/errors.go#L26-L33

I'd say that IsRetryable should be extended to also include some of the errors from https://github.com/gophercloud/gophercloud/blob/master/errors.go.

E.g.

  • 408: The server timed out waiting for the request
  • 429: Too many requests have been sent in a given amount of time. Pause requests, wait up to one minute, and try again.
  • and potentially 401: Authentication failed (as per your recommendation).

I think it is better to make the OpenStack calls via gophercloud retryable then filtering when setting the failureMessage/failureReason.

I'm no expert about OpenStack error codes and how gophercloud handles them - hopefully someone else can support me in this.

seanschneeweiss avatar Oct 21 '22 09:10 seanschneeweiss

I'd say that IsRetryable should be extended to also include some of the errors from gophercloud/gophercloud@master/errors.go.

E.g.

  • 408: The server timed out waiting for the request
  • 429: Too many requests have been sent in a given amount of time. Pause requests, wait up to one minute, and try again.
  • and potentially 401: Authentication failed (as per your recommendation).

I think it is better to make the OpenStack calls via gophercloud retryable then filtering when setting the failureMessage/failureReason.

I'm no expert about OpenStack error codes and how gophercloud handles them - hopefully someone else can support me in this.

sorry @mdbooth / @jichenjc for direct mentioning.

I'm fine with the proposed solution from @seanschneeweiss regarding extending IsRetryable. But i'm not sure if we oversee something. WDYT?

(FYI: this issue is currently the only one which is marked for milestone 0.6.4)

bavarianbidi avatar Oct 25 '22 10:10 bavarianbidi

@bavarianbidi Let's discuss the modification of IsRetryable in a separate issue/PR. Would you agree?

seanschneeweiss avatar Nov 04 '22 14:11 seanschneeweiss

/lgtm /hold

The /hold is in case you want to address the review comments. Feel free to remove it.

mdbooth avatar Nov 04 '22 16:11 mdbooth

I just want to say thank you for working on this @seanschneeweiss and reviewers! We are really looking forward to this feature.

oscr avatar Nov 05 '22 14:11 oscr

@mdbooth thank you for the great review. All suggestions were added. Would be great if you could check again.

seanschneeweiss avatar Nov 06 '22 09:11 seanschneeweiss

@bavarianbidi Let's discuss the modification of IsRetryable in a separate issue/PR. Would you agree?

@seanschneeweiss sure, i'm totally fine with that.

bavarianbidi avatar Nov 07 '22 05:11 bavarianbidi

/hold cancel

seanschneeweiss avatar Nov 08 '22 08:11 seanschneeweiss

/retest

Edit: I think there is an issue with Prow currently

image

tobiasgiese avatar Nov 08 '22 08:11 tobiasgiese

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lentzi90, mdbooth, seanschneeweiss

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • ~~OWNERS~~ [mdbooth,seanschneeweiss]

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar Nov 08 '22 08:11 k8s-ci-robot

/retest

seanschneeweiss avatar Nov 08 '22 08:11 seanschneeweiss

/retest

seanschneeweiss avatar Nov 08 '22 17:11 seanschneeweiss

/lgtm

tobiasgiese avatar Nov 09 '22 09:11 tobiasgiese

I think we should cherry pick this since we have it in the v0.6.4 milestone. /cherry-pick release-0.6

lentzi90 avatar Nov 09 '22 10:11 lentzi90

@lentzi90: #1301 failed to apply on top of branch "release-0.6":

Applying: openstackmachine: do not set transient error message and reason
Using index info to reconstruct a base tree...
A	api/v1alpha6/openstackmachine_types.go
M	controllers/openstackmachine_controller.go
M	controllers/openstackmachine_controller_test.go
Falling back to patching base and 3-way merge...
Auto-merging controllers/openstackmachine_controller_test.go
Auto-merging controllers/openstackmachine_controller.go
CONFLICT (content): Merge conflict in controllers/openstackmachine_controller.go
Auto-merging api/v1alpha5/openstackmachine_types.go
CONFLICT (content): Merge conflict in api/v1alpha5/openstackmachine_types.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 openstackmachine: do not set transient error message and reason
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

I think we should cherry pick this since we have it in the v0.6.4 milestone. /cherry-pick release-0.6

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Oh that was not so straight forward

lentzi90 avatar Nov 09 '22 10:11 lentzi90