cluster-api-provider-openstack
cluster-api-provider-openstack copied to clipboard
Handle transient OpenStack API errors for OpenStackMachines
/kind feature
OpenStackMachine reconciliation stops (for ever) on transient and non transient OpenStack API errors
- reconciler does not differentiate between
- transient API errors, e.g., InternalServerError, connectivity problems
- non transient errors, e.g., OpenStack server in ERROR state
- this leads to a stopped reconciliation, even though a subsequent reconciliation would be most likely to succeed
Describe the solution you'd like I would like CAPO to handle transient OpenStack API errors for OpenStackMachine deployments.
- Do not skip reconciliation for transient errors from the OpenStack API
- Do not set FailureMessage/FailureReason on openstackmachine.status for temporary errors.
- rethink when to return
ctrl.Result{}, error, etc.
Anything else you would like to add:
- lines where the reconciliation is skipped
- pay attention to FailureMessage hints about transient errors
- have a look at https://github.com/kubernetes-sigs/cluster-api-provider-openstack/issues/832 as conditions could help to not always use the "UpdateMachineError" https://github.com/kubernetes-sigs/cluster-api-provider-openstack/blob/8ae251502dbbcdd5685521272c7b4a5659152786/controllers/openstackmachine_controller.go#L403
Sean Schneeweiss [email protected], Daimler TSS GmbH, Provider Information
+1 to make this happen ,we should handle transient API errors, e.g., InternalServerError, connectivity problems instead of try once then give up
the question is how often do we reconcile (I think k8s has some kind of mechinasm but I don't know the detail , may be 1s->2s->4s->8s-> 16s etc??) then we can avoid flood of API calls to cloud ?
What's the problem with just re-running the reconciliation? The reason I ask is that I'd very much like to split machine creation into multiple parts (e.g. root volume, ports, instance) and reconcile them somewhat independently. The primary goal here would be to make failure handling simpler and more robust by aborting and restarting in more cases. So in the case of a transient failure of a port creation, for example, you would abort the whole reconcile and retry after a back-off. On retry you wouldn't have to recreate the root volume because that's already complete. You also wouldn't have to recreate any ports that were previously successfully created, because you wouldn't have had to delete ports when aborting the previous reconcile: progress is reflected in the status object.
So I'd like to systematically avoid the requirement to retry unreliable operations everywhere we do them by breaking creation up into multiple components, and ensuring each component is idempotent.
However, I very much agree that we don't want to put transient errors in FailureMessage. That's problematic.
What's the problem with just re-running the reconciliation?
You mean by the controller and not manually, right?
Yes, just return the transient error from the reconciler, log it (but don't write it to FailureMessage), then re-run the reconciler after a back-off.
I've started throwing together how this might look for root volume creation in https://github.com/kubernetes-sigs/cluster-api-provider-openstack/pull/1118. Context:
The return type of these 2 reconcile functions is (success, *OpenStackError). success==true indicates to the controller that the reconciliation completed, false means it was not created and needs to be called again. However, note that if we created ports the same way we could create ports while waiting for the root volume to become available.
OpenStackError is a wrapper type which indicates to the controller whether an error is fatal or not. The controller will mark the machine failed iff the error is fatal, otherwise it will log the error and retry after a back-off.
The reconcile methods update status but don't persist changes. The controller is left to patch the updated status object in an efficient manner.
Unfortunately I don't know when I'm going to get time to work on this properly, but this is an indication of what I want to do in this space. Would a solution like this address the problems you are seeing?
Yes, just return the transient error from the reconciler, log it (but don't write it to FailureMessage), then re-run the reconciler after a back-off.
That is exactly the way I was going for.
Thank you very much for the draft, really appreciate the effort.
I think that is has the right outcome but I hoped to make it simpler than always having to return a fatalError or transientError.
Looking at CAPA, they check the InstanceState and if an non-transient error occurs they set the FailureReason and FailureMessage.
And the reconciliation checks at the beginning (as we do now too), if the machine has failed.
Looks like we are already doing something like that
https://github.com/kubernetes-sigs/cluster-api-provider-openstack/blob/8ae251502dbbcdd5685521272c7b4a5659152786/controllers/openstackmachine_controller.go#L338-L341
The problem just seems to be that we call the function handleUpdateMachineError way to often.
So I guess the easy solution would be to just remove the handleUpdateMachineError at all places except the ones where it is valid to set the ErrorReason.
PS: Why is it called ErrorReason and not FailureReason as everywhere else :smile: I guess we could change that.
Coincidentally, we were discussing exactly this issue for OpenShift just the other day, and this was our suggestion also. Our reasoning was that the error state for a machine is terminal, but the error state for an OpenStack can be transient, not least because an operator can fix it in many cases.
I agree that InstanceStateError specifically should not put the machine in the error state.
However, we should check what the backoff strategy is for reconciliation. If it was every 10 seconds, for example, this could easily result in an API storm if a partial OpenStack outage resulted in a large number of servers in an error state.
Incidentally, something we also discussed for OpenShift (and may implement) is some kind of 'user-intervention-required' annotation on the Machine. So if the server enters an error state we don't mark the Machine as failed, but we do set an annotation on it which prevents any further attempts at reconciliation. The user can manually remove this annotation and we will attempt to reconcile it again, replacing the annotation if it is still in an error state. I'd be very interested to know how you folks would feel about this kind of mechanism.
If it was every 10 seconds, for example, this could easily result in an API storm if a partial OpenStack outage resulted in a large number of servers in an error state.
that's my concern, too, we should avoid too many API calls
So if the server enters an error state we don't mark the Machine as failed, but we do set an annotation on it which prevents any further attempts at reconciliation. The user can manually remove this annotation and we will attempt to reconcile it again, replacing the annotation if it is still in an error state.
I think it's reasonable solution, I am only curious whether the removal of the annotation is able to retrigger the reconcile or we have to wait for next timed cycle?
So if the server enters an error state we don't mark the Machine as failed, but we do set an annotation on it which prevents any further attempts at reconciliation. The user can manually remove this annotation and we will attempt to reconcile it again, replacing the annotation if it is still in an error state.
I think it's reasonable solution, I am only curious whether the removal of the annotation is able to retrigger the reconcile or we have to wait for next timed cycle?
Removing the annotation would trigger immediate reconciliation as it's a change to the object.
I wonder if we should discuss this idea in cluster-api if it's something we're likely to consider seriously. Perhaps they already have thoughts in this area which make this moot. If not it would probably be best to be consistent across providers with an approach like this.
Setting the paused annotation would be a quick fix to have this kind of behavior. The controller already skips reconciliation, when cluster.x-k8s.io/paused exists.
However, I'm currently leaning more towards setting the errorReason only for those three cases:
- InstanceStateError https://github.com/kubernetes-sigs/cluster-api-provider-openstack/blob/f83520c503425512a2efdf0ef55ad4cc85a30902/controllers/openstackmachine_controller.go#L338-L341
- Unknown instance state https://github.com/kubernetes-sigs/cluster-api-provider-openstack/blob/f83520c503425512a2efdf0ef55ad4cc85a30902/controllers/openstackmachine_controller.go#L346-L350
- Instance not found https://github.com/kubernetes-sigs/cluster-api-provider-openstack/blob/f83520c503425512a2efdf0ef55ad4cc85a30902/controllers/openstackmachine_controller.go#L312-L315
All other occurrences, that previously set the errorReason, are adapted to just return an error, e.g.
# remove call of `handleUpdateMachineError`
return ctrl.Result{}, errors.Wrap(err, "reason for failure")
These three cases are very unlikely to be fixed automatically. This can be resolved by:
- manually fixing configuration, fixing OpenStack problem, ...
- then deleting Machine / OpenStackMachine
- CAPI will ensure that the machine is replaced by a new one
On the long run, CAPI will handle the deletion of broken machines with its remediation logic.
If a Machine fails for any reason (if the FailureReason is set), the Machine will be remediated immediately source
However, I'm currently leaning more towards setting the
errorReasononly for those three cases:
- InstanceStateError https://github.com/kubernetes-sigs/cluster-api-provider-openstack/blob/f83520c503425512a2efdf0ef55ad4cc85a30902/controllers/openstackmachine_controller.go#L338-L341
- Unknown instance state https://github.com/kubernetes-sigs/cluster-api-provider-openstack/blob/f83520c503425512a2efdf0ef55ad4cc85a30902/controllers/openstackmachine_controller.go#L346-L350
- Instance not found https://github.com/kubernetes-sigs/cluster-api-provider-openstack/blob/f83520c503425512a2efdf0ef55ad4cc85a30902/controllers/openstackmachine_controller.go#L312-L315
Interestingly, 2 of these 3 cases directly relate to states which are sometimes recoverable, and we've had real world customer reports about.
- InstateStateError
A VM in an error state after initial creation is almost always recoverable if you're sufficiently determined. It sometimes requires operator action, but depending on the cause hard rebooting it is frequently enough to bring it back (this is one of the things hard reboot is specifically intended to do).
If the instance entered error during creation or shortly after it's probably not worth the effort, but there are cases where recovering an error machine is worth it, and users certainly sometimes want to have this choice.
- Unknown instance state
Actually we MUST NOT assume this is fatal. This could cause us to mark all machines as failed if Nova introduces a new instance state. We also specifically don't want to mark machines failed which are migrating, shutoff, paused, whatever the post-resize confirm state is, etc, etc. Better not to try to enumerate all the things and only handle the ones we're specifically interested in.
- Instance not found
This is a really interesting one that I've seen at least 2 reports of (I think I also saw it on vSphere?). An operator rotating credentials accidentally used credentials from the wrong environment (in one specific case it was their staging environment). This caused all their machines to be marked failed because (the wrong) OpenStack positively reported that the machines did not exist.
I've actually been thinking of ways to try to mitigate this. I think we want to store a project ID in the OpenStackMachineStatus, and refuse to reconcile if the current project ID does not correspond.
This one specifically is currently my best argument for a 'recoverable' failed state. Putting the machine in failed is certainly correct in this instance. Unless the user has done something dumb, this machine is not coming back. However, if the user did do something dumb, it would be nice to give them a way out of it. (And also we should try to prevent them from doing it in future.)
Although this one specifically can (and should) be mitigated, I doubt this is the only way a temporary human or technical failure may mark machines failed which are not truly failed.
That is very valuable input, thank you. Sounds very reasonable too.
So to summarize, the only cases where setting the errorReason and errorMessage on the OpenStackMachine would be
- InstanceStateError
- Instance not found
If the VM can be fixed manually by, e.g., hard rebooting - the user can continue the reconciliation of the OpenStackMachine by patching the status.
Nice tool to do that:
kubectl krew update
kubectl krew install edit-status
kubectl edit-status -n my-cluster osm myBrokenOpenStackMachine
Or patch by using curl after running kubectl proxy in a separate shell:
curl -XPATCH "http://localhost:8001/apis/infrastructure.cluster.x-k8s.io/v1alpha4/namespaces/my-cluster/openstackmachines/myBrokenOpenStackMachine/status" -H "Content-Type: application/merge-patch+json" -d '{
"status": {
"errorMessage": null,
"errorReason": null
}
}'
Just an idea:
A different approach would be to set the paused annotation cluster.x-k8s.io/paused as it can be easily removed after the OpenStackMachine was fixed. Also we could set a value that indicates which controller added the annotation.
cluster.x-k8s.io/paused: "information why and who added the annotation"
Just an idea: A different approach would be to set the paused annotation
cluster.x-k8s.io/pausedas it can be easily removed after theOpenStackMachinewas fixed. Also we could set a value that indicates which controller added the annotation.cluster.x-k8s.io/paused: "information why and who added the annotation"
This or something like it does sound like a better solution than asking the user to patch the status. However, I think we should discuss this in CAPI so we don't end up being inconsistent. For now I think you're right that we should use the failed status we have.
For "Instance not found" we should certainly set failed. For InstanceStateError... I remain undecided whether we should continue to retry, but for now it's certainly not a change in behaviour.
There might be one other useful case for a failed machine: a non-retryable error during resource creation. This typically indicates invalid parameters, so retrying will not achieve anything. A counter argument may be that the invalid parameter was a reference to a resource which has not yet been created and may exist when we retry. However, certainly for resources managed by the controller we don't allow this to happen.
a non-retryable error during resource creation. This typically indicates invalid parameters, so retrying will not achieve anything.
I think mostly the invalid parm lead to API layer validation fail and never enter resource creation stage? usually an error state of the resource comes from comp/volume layer error e.g some agent failed or some actions can't be fulfilled ...so seems retry later on is valid approach
For "Instance not found" we should certainly set failed.
will someone accidently deleted the VM caused the VM to be in not-found stage? should we re-create it in our reconcile logic? understand this will cause some data inconsistency e.g uuid etc but should not be a blocker?
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale - Mark this issue or PR as rotten with
/lifecycle rotten - Close this issue or PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
CAPZ reconciles machines, if they are in a transient error:
https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/0c85feca9ee6f8ba4e566b959bd50981d80a79eb/controllers/azuremachine_controller.go#L304-L324
CAPZ reconciles machines, if they are in a transient error:
I like how CAPZ is handling transient errors.
Just for reference; CAPA is not reconciling transient errors
https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/823459d83416f6d3b063ac9496be4879141ed66d/controllers/awsmachine_controller.go#L434-L445
https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/823459d83416f6d3b063ac9496be4879141ed66d/controllers/awsmachine_controller.go#L541-L547
What are the next steps on this @mdbooth / @jichenjc / @seanschneeweiss / @tobiasgiese
should we implement the transient error handling for OpenStackMachines similar to CAPZ or are you planning to discuss this topic with CAPI folks first?
I like the CAPZ idea and I think we can follow CAPZ ..the question become how to define the trasient or persistent .
I'm out of office this week. So if anyone finds time this week already, I'm happy to join them next week. Otherwise I can start with this after being back.
Hi. i am reading this thread trying to figure out how I can reset the osm status today. We had an Openstack API outage, some compute nodes have been rebooted and several osm's have been detected SHUTOFF.
Status:
Addresses:
Address: 1.1.1.1
Type: InternalIP
Error Message: OpenStack instance state "SHUTOFF" is unexpected
Error Reason: UpdateError
Instance State: SHUTOFF
Ready: true
The instances have been recovered and are now up but the OpenStackMachines are stuck in this error state.
I am curious if there is anyway to reconcile the status those without deleting the Machines and have them recreated?
@cunningr We started to use kubectl patch --subresource='status' command to remove failureReason and failureMessage from CRs.
The steps from our runbook:
- Ensure that kubectl version is >= 1.24 since the script below uses --subresource flag to patch status part of CRs.
- Copy/paste the script below to remove-errors.sh
- Run it
bash remove-errors.sh <namespace-of-cluster> <name-of-cluster>
#!/usr/bin/env bash
set -o errexit
set -o nounset
set -o pipefail
function annotate_machines_to_disable_remediation(){
local namespace="$1"
kubectl -n "$namespace" annotate machine --all "cluster.x-k8s.io/skip-remediation"=""
}
function deannotate_machines_to_enable_remediation(){
local namespace="$1"
kubectl -n "$namespace" annotate machine --all "cluster.x-k8s.io/skip-remediation"-
}
function clean_all_machines(){
local namespace="$1"
machines=$(kubectl get machine -n "$namespace" -o=custom-columns=NAME:.metadata.name --no-headers)
for machine in $machines
do
clean_machine "$namespace" "$machine"
done
}
function clean_all_openstackmachines(){
local namespace="$1"
openstackmachines=$(kubectl get openstackmachine -n "$namespace" -o=custom-columns=NAME:.metadata.name --no-headers)
for openstackmachine in $openstackmachines
do
clean_openstack_machine "$namespace" "$openstackmachine"
done
}
function clean_openstack_machine(){
local namespace="$1"
local name="$2"
kubectl patch openstackmachine "$name" -n "$namespace" \
-p='{"status":{"failureReason":null, "failureMessage":null}}' \
--subresource='status' \
--type="merge"
}
function clean_machine(){
local namespace="$1"
local name="$2"
kubectl patch machine "$name" -n "$namespace" \
-p='{"status":{"failureReason":null, "failureMessage":null, "phase":null}}' \
--subresource='status' \
--type="merge"
}
function clean_openstack_cluster(){
local namespace="$1"
local name="$2"
kubectl patch openstackcluster "$name" -n "$namespace" \
-p='{"status":{"failureReason":null, "failureMessage":null}}' \
--subresource='status' \
--type="merge"
}
function clean_cluster(){
local namespace="$1"
local name="$2"
kubectl patch cluster "$name" -n "$namespace" \
-p='{"status":{"failureReason":null, "failureMessage":null, "phase":null}}' \
--subresource='status' \
--type="merge"
}
namespace="$1"
cluster="$2"
annotate_machines_to_disable_remediation "$namespace"
clean_all_openstackmachines "$namespace"
clean_all_machines "$namespace"
clean_openstack_cluster "$namespace" "$cluster"
clean_cluster "$namespace" "$cluster"
deannotate_machines_to_enable_remediation "$namespace"
Thank you @erkanerol! Apologies but I am little new to Cluster API.
Looking at the script above, we seem to be modifying all machines and not just the target cluster? Our CAPI cluster is managing multiple workload clusters and I rather if we could just target a single cluster for remediation if possible.
Is it that we need to disable remediation for everything else before we clean our target cluster?
The script I shared removes errors from CRs of a single WC in a specific namespace. The remediation thing is to disable MachineHealthCheck controller temporarily while fixing machines. Otherwise, the controller can start deleting&recreating machines in the middle of the process according to maxUnhealthy value.
I want to also emphasize that it is just a workaround until we get a real fix from CAPO. It should be fixed in CAPO. See https://github.com/kubernetes-sigs/cluster-api/issues/6790
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale - Mark this issue or PR as rotten with
/lifecycle rotten - Close this issue or PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale