cluster-api-provider-openstack icon indicating copy to clipboard operation
cluster-api-provider-openstack copied to clipboard

Handle transient OpenStack API errors for OpenStackMachines

Open seanschneeweiss opened this issue 3 years ago • 27 comments

/kind feature

OpenStackMachine reconciliation stops (for ever) on transient and non transient OpenStack API errors

  • reconciler does not differentiate between
    • transient API errors, e.g., InternalServerError, connectivity problems
    • non transient errors, e.g., OpenStack server in ERROR state
  • this leads to a stopped reconciliation, even though a subsequent reconciliation would be most likely to succeed

Describe the solution you'd like I would like CAPO to handle transient OpenStack API errors for OpenStackMachine deployments.

Anything else you would like to add:

Sean Schneeweiss [email protected], Daimler TSS GmbH, Provider Information

seanschneeweiss avatar Jan 18 '22 14:01 seanschneeweiss

+1 to make this happen ,we should handle transient API errors, e.g., InternalServerError, connectivity problems instead of try once then give up

the question is how often do we reconcile (I think k8s has some kind of mechinasm but I don't know the detail , may be 1s->2s->4s->8s-> 16s etc??) then we can avoid flood of API calls to cloud ?

jichenjc avatar Jan 19 '22 02:01 jichenjc

What's the problem with just re-running the reconciliation? The reason I ask is that I'd very much like to split machine creation into multiple parts (e.g. root volume, ports, instance) and reconcile them somewhat independently. The primary goal here would be to make failure handling simpler and more robust by aborting and restarting in more cases. So in the case of a transient failure of a port creation, for example, you would abort the whole reconcile and retry after a back-off. On retry you wouldn't have to recreate the root volume because that's already complete. You also wouldn't have to recreate any ports that were previously successfully created, because you wouldn't have had to delete ports when aborting the previous reconcile: progress is reflected in the status object.

So I'd like to systematically avoid the requirement to retry unreliable operations everywhere we do them by breaking creation up into multiple components, and ensuring each component is idempotent.

However, I very much agree that we don't want to put transient errors in FailureMessage. That's problematic.

mdbooth avatar Jan 21 '22 12:01 mdbooth

What's the problem with just re-running the reconciliation?

You mean by the controller and not manually, right?

kopiczko avatar Jan 21 '22 14:01 kopiczko

Yes, just return the transient error from the reconciler, log it (but don't write it to FailureMessage), then re-run the reconciler after a back-off.

mdbooth avatar Jan 21 '22 14:01 mdbooth

I've started throwing together how this might look for root volume creation in https://github.com/kubernetes-sigs/cluster-api-provider-openstack/pull/1118. Context:

The return type of these 2 reconcile functions is (success, *OpenStackError). success==true indicates to the controller that the reconciliation completed, false means it was not created and needs to be called again. However, note that if we created ports the same way we could create ports while waiting for the root volume to become available.

OpenStackError is a wrapper type which indicates to the controller whether an error is fatal or not. The controller will mark the machine failed iff the error is fatal, otherwise it will log the error and retry after a back-off.

The reconcile methods update status but don't persist changes. The controller is left to patch the updated status object in an efficient manner.

Unfortunately I don't know when I'm going to get time to work on this properly, but this is an indication of what I want to do in this space. Would a solution like this address the problems you are seeing?

mdbooth avatar Jan 21 '22 15:01 mdbooth

Yes, just return the transient error from the reconciler, log it (but don't write it to FailureMessage), then re-run the reconciler after a back-off.

That is exactly the way I was going for.

Thank you very much for the draft, really appreciate the effort. I think that is has the right outcome but I hoped to make it simpler than always having to return a fatalError or transientError.

Looking at CAPA, they check the InstanceState and if an non-transient error occurs they set the FailureReason and FailureMessage.

And the reconciliation checks at the beginning (as we do now too), if the machine has failed.

Looks like we are already doing something like that https://github.com/kubernetes-sigs/cluster-api-provider-openstack/blob/8ae251502dbbcdd5685521272c7b4a5659152786/controllers/openstackmachine_controller.go#L338-L341 The problem just seems to be that we call the function handleUpdateMachineError way to often. So I guess the easy solution would be to just remove the handleUpdateMachineError at all places except the ones where it is valid to set the ErrorReason.

PS: Why is it called ErrorReason and not FailureReason as everywhere else :smile: I guess we could change that.

seanschneeweiss avatar Jan 26 '22 10:01 seanschneeweiss

Coincidentally, we were discussing exactly this issue for OpenShift just the other day, and this was our suggestion also. Our reasoning was that the error state for a machine is terminal, but the error state for an OpenStack can be transient, not least because an operator can fix it in many cases.

I agree that InstanceStateError specifically should not put the machine in the error state.

However, we should check what the backoff strategy is for reconciliation. If it was every 10 seconds, for example, this could easily result in an API storm if a partial OpenStack outage resulted in a large number of servers in an error state.

Incidentally, something we also discussed for OpenShift (and may implement) is some kind of 'user-intervention-required' annotation on the Machine. So if the server enters an error state we don't mark the Machine as failed, but we do set an annotation on it which prevents any further attempts at reconciliation. The user can manually remove this annotation and we will attempt to reconcile it again, replacing the annotation if it is still in an error state. I'd be very interested to know how you folks would feel about this kind of mechanism.

mdbooth avatar Jan 27 '22 17:01 mdbooth

If it was every 10 seconds, for example, this could easily result in an API storm if a partial OpenStack outage resulted in a large number of servers in an error state.

that's my concern, too, we should avoid too many API calls

So if the server enters an error state we don't mark the Machine as failed, but we do set an annotation on it which prevents any further attempts at reconciliation. The user can manually remove this annotation and we will attempt to reconcile it again, replacing the annotation if it is still in an error state.

I think it's reasonable solution, I am only curious whether the removal of the annotation is able to retrigger the reconcile or we have to wait for next timed cycle?

jichenjc avatar Jan 28 '22 03:01 jichenjc

So if the server enters an error state we don't mark the Machine as failed, but we do set an annotation on it which prevents any further attempts at reconciliation. The user can manually remove this annotation and we will attempt to reconcile it again, replacing the annotation if it is still in an error state.

I think it's reasonable solution, I am only curious whether the removal of the annotation is able to retrigger the reconcile or we have to wait for next timed cycle?

Removing the annotation would trigger immediate reconciliation as it's a change to the object.

I wonder if we should discuss this idea in cluster-api if it's something we're likely to consider seriously. Perhaps they already have thoughts in this area which make this moot. If not it would probably be best to be consistent across providers with an approach like this.

mdbooth avatar Jan 28 '22 09:01 mdbooth

Setting the paused annotation would be a quick fix to have this kind of behavior. The controller already skips reconciliation, when cluster.x-k8s.io/paused exists.

However, I'm currently leaning more towards setting the errorReason only for those three cases:

  • InstanceStateError https://github.com/kubernetes-sigs/cluster-api-provider-openstack/blob/f83520c503425512a2efdf0ef55ad4cc85a30902/controllers/openstackmachine_controller.go#L338-L341
  • Unknown instance state https://github.com/kubernetes-sigs/cluster-api-provider-openstack/blob/f83520c503425512a2efdf0ef55ad4cc85a30902/controllers/openstackmachine_controller.go#L346-L350
  • Instance not found https://github.com/kubernetes-sigs/cluster-api-provider-openstack/blob/f83520c503425512a2efdf0ef55ad4cc85a30902/controllers/openstackmachine_controller.go#L312-L315

All other occurrences, that previously set the errorReason, are adapted to just return an error, e.g.

# remove call of `handleUpdateMachineError`
return ctrl.Result{}, errors.Wrap(err, "reason for failure")

These three cases are very unlikely to be fixed automatically. This can be resolved by:

  • manually fixing configuration, fixing OpenStack problem, ...
  • then deleting Machine / OpenStackMachine
  • CAPI will ensure that the machine is replaced by a new one

On the long run, CAPI will handle the deletion of broken machines with its remediation logic.

If a Machine fails for any reason (if the FailureReason is set), the Machine will be remediated immediately source

seanschneeweiss avatar Feb 01 '22 22:02 seanschneeweiss

However, I'm currently leaning more towards setting the errorReason only for those three cases:

  • InstanceStateError https://github.com/kubernetes-sigs/cluster-api-provider-openstack/blob/f83520c503425512a2efdf0ef55ad4cc85a30902/controllers/openstackmachine_controller.go#L338-L341
  • Unknown instance state https://github.com/kubernetes-sigs/cluster-api-provider-openstack/blob/f83520c503425512a2efdf0ef55ad4cc85a30902/controllers/openstackmachine_controller.go#L346-L350
  • Instance not found https://github.com/kubernetes-sigs/cluster-api-provider-openstack/blob/f83520c503425512a2efdf0ef55ad4cc85a30902/controllers/openstackmachine_controller.go#L312-L315

Interestingly, 2 of these 3 cases directly relate to states which are sometimes recoverable, and we've had real world customer reports about.

  • InstateStateError

A VM in an error state after initial creation is almost always recoverable if you're sufficiently determined. It sometimes requires operator action, but depending on the cause hard rebooting it is frequently enough to bring it back (this is one of the things hard reboot is specifically intended to do).

If the instance entered error during creation or shortly after it's probably not worth the effort, but there are cases where recovering an error machine is worth it, and users certainly sometimes want to have this choice.

  • Unknown instance state

Actually we MUST NOT assume this is fatal. This could cause us to mark all machines as failed if Nova introduces a new instance state. We also specifically don't want to mark machines failed which are migrating, shutoff, paused, whatever the post-resize confirm state is, etc, etc. Better not to try to enumerate all the things and only handle the ones we're specifically interested in.

  • Instance not found

This is a really interesting one that I've seen at least 2 reports of (I think I also saw it on vSphere?). An operator rotating credentials accidentally used credentials from the wrong environment (in one specific case it was their staging environment). This caused all their machines to be marked failed because (the wrong) OpenStack positively reported that the machines did not exist.

I've actually been thinking of ways to try to mitigate this. I think we want to store a project ID in the OpenStackMachineStatus, and refuse to reconcile if the current project ID does not correspond.

This one specifically is currently my best argument for a 'recoverable' failed state. Putting the machine in failed is certainly correct in this instance. Unless the user has done something dumb, this machine is not coming back. However, if the user did do something dumb, it would be nice to give them a way out of it. (And also we should try to prevent them from doing it in future.)

Although this one specifically can (and should) be mitigated, I doubt this is the only way a temporary human or technical failure may mark machines failed which are not truly failed.

mdbooth avatar Feb 02 '22 09:02 mdbooth

That is very valuable input, thank you. Sounds very reasonable too. So to summarize, the only cases where setting the errorReason and errorMessage on the OpenStackMachine would be

  • InstanceStateError
  • Instance not found

If the VM can be fixed manually by, e.g., hard rebooting - the user can continue the reconciliation of the OpenStackMachine by patching the status.

Nice tool to do that:

kubectl krew update
kubectl krew install edit-status
kubectl edit-status -n my-cluster osm myBrokenOpenStackMachine

Or patch by using curl after running kubectl proxy in a separate shell:

curl -XPATCH "http://localhost:8001/apis/infrastructure.cluster.x-k8s.io/v1alpha4/namespaces/my-cluster/openstackmachines/myBrokenOpenStackMachine/status" -H "Content-Type: application/merge-patch+json" -d '{
   "status": {
      "errorMessage": null,
      "errorReason": null
   }
}'

Just an idea: A different approach would be to set the paused annotation cluster.x-k8s.io/paused as it can be easily removed after the OpenStackMachine was fixed. Also we could set a value that indicates which controller added the annotation. cluster.x-k8s.io/paused: "information why and who added the annotation"

seanschneeweiss avatar Feb 04 '22 07:02 seanschneeweiss

Just an idea: A different approach would be to set the paused annotation cluster.x-k8s.io/paused as it can be easily removed after the OpenStackMachine was fixed. Also we could set a value that indicates which controller added the annotation. cluster.x-k8s.io/paused: "information why and who added the annotation"

This or something like it does sound like a better solution than asking the user to patch the status. However, I think we should discuss this in CAPI so we don't end up being inconsistent. For now I think you're right that we should use the failed status we have.

For "Instance not found" we should certainly set failed. For InstanceStateError... I remain undecided whether we should continue to retry, but for now it's certainly not a change in behaviour.

There might be one other useful case for a failed machine: a non-retryable error during resource creation. This typically indicates invalid parameters, so retrying will not achieve anything. A counter argument may be that the invalid parameter was a reference to a resource which has not yet been created and may exist when we retry. However, certainly for resources managed by the controller we don't allow this to happen.

mdbooth avatar Feb 04 '22 09:02 mdbooth

a non-retryable error during resource creation. This typically indicates invalid parameters, so retrying will not achieve anything.

I think mostly the invalid parm lead to API layer validation fail and never enter resource creation stage? usually an error state of the resource comes from comp/volume layer error e.g some agent failed or some actions can't be fulfilled ...so seems retry later on is valid approach

For "Instance not found" we should certainly set failed.

will someone accidently deleted the VM caused the VM to be in not-found stage? should we re-create it in our reconcile logic? understand this will cause some data inconsistency e.g uuid etc but should not be a blocker?

jichenjc avatar Feb 07 '22 03:02 jichenjc

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar May 08 '22 03:05 k8s-triage-robot

/remove-lifecycle stale

bavarianbidi avatar May 12 '22 18:05 bavarianbidi

CAPZ reconciles machines, if they are in a transient error:

https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/0c85feca9ee6f8ba4e566b959bd50981d80a79eb/controllers/azuremachine_controller.go#L304-L324

bavarianbidi avatar Jun 20 '22 06:06 bavarianbidi

CAPZ reconciles machines, if they are in a transient error:

I like how CAPZ is handling transient errors.


Just for reference; CAPA is not reconciling transient errors

https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/823459d83416f6d3b063ac9496be4879141ed66d/controllers/awsmachine_controller.go#L434-L445

https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/823459d83416f6d3b063ac9496be4879141ed66d/controllers/awsmachine_controller.go#L541-L547

tobiasgiese avatar Jun 20 '22 06:06 tobiasgiese

What are the next steps on this @mdbooth / @jichenjc / @seanschneeweiss / @tobiasgiese

should we implement the transient error handling for OpenStackMachines similar to CAPZ or are you planning to discuss this topic with CAPI folks first?

bavarianbidi avatar Jun 27 '22 09:06 bavarianbidi

I like the CAPZ idea and I think we can follow CAPZ ..the question become how to define the trasient or persistent .

jichenjc avatar Jun 28 '22 03:06 jichenjc

I'm out of office this week. So if anyone finds time this week already, I'm happy to join them next week. Otherwise I can start with this after being back.

seanschneeweiss avatar Jun 28 '22 07:06 seanschneeweiss

Hi. i am reading this thread trying to figure out how I can reset the osm status today. We had an Openstack API outage, some compute nodes have been rebooted and several osm's have been detected SHUTOFF.

Status:
  Addresses:
    Address:       1.1.1.1
    Type:          InternalIP
  Error Message:   OpenStack instance state "SHUTOFF" is unexpected
  Error Reason:    UpdateError
  Instance State:  SHUTOFF
  Ready:           true

The instances have been recovered and are now up but the OpenStackMachines are stuck in this error state.

I am curious if there is anyway to reconcile the status those without deleting the Machines and have them recreated?

cunningr avatar Jul 06 '22 12:07 cunningr

@cunningr We started to use kubectl patch --subresource='status' command to remove failureReason and failureMessage from CRs.

The steps from our runbook:

  • Ensure that kubectl version is >= 1.24 since the script below uses --subresource flag to patch status part of CRs.
  • Copy/paste the script below to remove-errors.sh
  • Run it
bash remove-errors.sh <namespace-of-cluster> <name-of-cluster>

#!/usr/bin/env bash

set -o errexit
set -o nounset
set -o pipefail

function annotate_machines_to_disable_remediation(){
    local namespace="$1"
    kubectl -n "$namespace" annotate machine --all "cluster.x-k8s.io/skip-remediation"=""
}

function deannotate_machines_to_enable_remediation(){
    local namespace="$1"
    kubectl -n "$namespace" annotate machine --all "cluster.x-k8s.io/skip-remediation"-
}

function clean_all_machines(){
    local namespace="$1" 

    machines=$(kubectl get machine -n "$namespace" -o=custom-columns=NAME:.metadata.name --no-headers)
    for machine in $machines
    do
        clean_machine "$namespace" "$machine"
    done
}

function clean_all_openstackmachines(){
        local namespace="$1" 

    openstackmachines=$(kubectl get openstackmachine -n "$namespace" -o=custom-columns=NAME:.metadata.name --no-headers)
    for openstackmachine in $openstackmachines
    do
        clean_openstack_machine "$namespace" "$openstackmachine"
    done
}

function clean_openstack_machine(){
    local namespace="$1"
    local name="$2"

    kubectl patch openstackmachine  "$name" -n "$namespace" \
        -p='{"status":{"failureReason":null, "failureMessage":null}}' \
        --subresource='status' \
        --type="merge"
}

function clean_machine(){
    local namespace="$1"
    local name="$2"

    kubectl patch machine  "$name" -n "$namespace" \
        -p='{"status":{"failureReason":null, "failureMessage":null, "phase":null}}' \
        --subresource='status' \
        --type="merge"
}

function clean_openstack_cluster(){
    local namespace="$1"
    local name="$2"

    kubectl patch openstackcluster  "$name" -n "$namespace" \
        -p='{"status":{"failureReason":null, "failureMessage":null}}' \
        --subresource='status' \
        --type="merge"
}

function clean_cluster(){
    local namespace="$1"
    local name="$2"

    kubectl patch cluster "$name" -n "$namespace" \
        -p='{"status":{"failureReason":null, "failureMessage":null, "phase":null}}' \
        --subresource='status' \
        --type="merge"
}


namespace="$1"
cluster="$2"

annotate_machines_to_disable_remediation "$namespace"
clean_all_openstackmachines "$namespace"
clean_all_machines "$namespace"
clean_openstack_cluster "$namespace" "$cluster"
clean_cluster "$namespace" "$cluster"
deannotate_machines_to_enable_remediation "$namespace"

erkanerol avatar Jul 06 '22 12:07 erkanerol

Thank you @erkanerol! Apologies but I am little new to Cluster API.

Looking at the script above, we seem to be modifying all machines and not just the target cluster? Our CAPI cluster is managing multiple workload clusters and I rather if we could just target a single cluster for remediation if possible.

Is it that we need to disable remediation for everything else before we clean our target cluster?

cunningr avatar Jul 06 '22 20:07 cunningr

The script I shared removes errors from CRs of a single WC in a specific namespace. The remediation thing is to disable MachineHealthCheck controller temporarily while fixing machines. Otherwise, the controller can start deleting&recreating machines in the middle of the process according to maxUnhealthy value.

I want to also emphasize that it is just a workaround until we get a real fix from CAPO. It should be fixed in CAPO. See https://github.com/kubernetes-sigs/cluster-api/issues/6790

erkanerol avatar Jul 07 '22 09:07 erkanerol

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Oct 05 '22 09:10 k8s-triage-robot

/remove-lifecycle stale

kopiczko avatar Oct 05 '22 19:10 kopiczko