Kubernetes-acs-engine-autoscaler
Kubernetes-acs-engine-autoscaler copied to clipboard
Scale out deployments fail with OS provisioning failure
We've been using the auto-scaler for a few months now with great satisfaction until a few days ago our deployments started failing. The error is consistent and has a peculiar pattern - deployments fail with the following error -
Message: At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/arm-debug for usage details.
Exception Details:
Error Code: Conflict
Message: {
"status": "Failed",
"error": {
"code": "ResourceDeploymentFailure",
"message": "The resource operation completed with terminal provisioning state 'Failed'.",
"details": [
{
"code": "OSProvisioningTimedOut",
"message": "OS Provisioning for VM 'k8s-devops-69325501-4' did not finish in the allotted time. The VM may still finish provisioning successfully. Please check provisioning state later."
}
]
}
}
Target: None
Error Code: Conflict
Message: {
"status": "Failed",
"error": {
"code": "ResourceDeploymentFailure",
"message": "The resource operation completed with terminal provisioning state 'Failed'.",
"details": [
{
"code": "OSProvisioningTimedOut",
"message": "OS provisioning failure has reached terminal state and is non-recoverable for VM 'k8s-devops-69325501-3'. Consider deleting and recreating this virtual machine. Additional Details: OS Provisioning for VM 'k8s-devops-69325501-3' did not finish in the allotted time. The VM may still finish provisioning successfully. Please check provisioning state later."
}
]
}
}
Target: None
If I delete the failed resources manually and wait for the next scaling events, the nodes scale just fine. The next time the cluster has to scale in and back out - those nodes fail again.
Another interesting pattern is that the same nodes fail every time, while others are created successfully (in my case it's nodes number 1,3,4 - while every other node is created successfully)
What can be the cause of this strange issue? What is the difference between removing the VM, NIC and osdisk manually using the azure UI and the way the autoscaler performs these steps? From looking into the code and checking the azure python SDK I've expected the operations to be identical.
Any help would be appreciated, thanks.
I didn't release any new version of the autoscaler since quite a while, so I doubt this is an issue with the autoscaler itself.
What is the difference between removing the VM, NIC and osdisk manually using the azure UI and the way the autoscaler performs these steps?
There should be none.
Any help would be appreciated, thanks.
I would suggest opening a ticket on Azure specifying the region where your cluster is hosted and the VM size. This could happen for example when a certain type of VM is seeing too much demande in a single DC.
Well, I did open a ticket to azure and managed to reproduce the error using the code in this gist - https://gist.github.com/yaron-idan/91a1193e40cb0da5ce42a106bf1a91e0 which is basically just the code the autoscaler is using duplicated. Support Escalation Engineer from Azure pointed out that using the latest dependencies fixes the issue, and testing it out shows that he is correct. Is there any objection to me opening a PR updating the required dependencies?
Is there any objection to me opening a PR updating the required dependencies?
Absolutly not, go ahead.
Are you still experiencing this issue? It completly disappeared on my side so it probably was an intermittent Azure issue.
Well, for us the error is very much not intermittent, but it does persist so far on a very specific resource group (we have 2 clusters, and only one of them is encountering errors)
Sadly I cannot upgrade the autoscaler to the latest version on that cluster to see if this solves the issue (I'm missing the etcdprivatekey
param).
I'll create a new cluster soon and see if it shows any symptoms, if it doesn't I'll close the issue.
Thanks for the update!
The latest version of the autoscaler doesn't have etcdprivatekey
anymore, take a look at the helm-chart on master (you also don't need kubeprivatekey
) anymore. So you can try this out.