Kubernetes-acs-engine-autoscaler icon indicating copy to clipboard operation
Kubernetes-acs-engine-autoscaler copied to clipboard

Scale out deployments fail with OS provisioning failure

Open yaron-idan opened this issue 7 years ago • 6 comments

We've been using the auto-scaler for a few months now with great satisfaction until a few days ago our deployments started failing. The error is consistent and has a peculiar pattern - deployments fail with the following error -

Message: At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/arm-debug for usage details.
Exception Details:
	Error Code: Conflict
	Message: {
  "status": "Failed",
  "error": {
    "code": "ResourceDeploymentFailure",
    "message": "The resource operation completed with terminal provisioning state 'Failed'.",
    "details": [
      {
        "code": "OSProvisioningTimedOut",
        "message": "OS Provisioning for VM 'k8s-devops-69325501-4' did not finish in the allotted time. The VM may still finish provisioning successfully. Please check provisioning state later."
      }
    ]
  }
}
	Target: None
	Error Code: Conflict
	Message: {
  "status": "Failed",
  "error": {
    "code": "ResourceDeploymentFailure",
    "message": "The resource operation completed with terminal provisioning state 'Failed'.",
    "details": [
      {
        "code": "OSProvisioningTimedOut",
        "message": "OS provisioning failure has reached terminal state and is non-recoverable for VM 'k8s-devops-69325501-3'. Consider deleting and recreating this virtual machine. Additional Details: OS Provisioning for VM 'k8s-devops-69325501-3' did not finish in the allotted time. The VM may still finish provisioning successfully. Please check provisioning state later."
      }
    ]
  }
}
	Target: None

If I delete the failed resources manually and wait for the next scaling events, the nodes scale just fine. The next time the cluster has to scale in and back out - those nodes fail again.

Another interesting pattern is that the same nodes fail every time, while others are created successfully (in my case it's nodes number 1,3,4 - while every other node is created successfully)

What can be the cause of this strange issue? What is the difference between removing the VM, NIC and osdisk manually using the azure UI and the way the autoscaler performs these steps? From looking into the code and checking the azure python SDK I've expected the operations to be identical.

Any help would be appreciated, thanks.

yaron-idan avatar Jan 28 '18 18:01 yaron-idan

I didn't release any new version of the autoscaler since quite a while, so I doubt this is an issue with the autoscaler itself.

What is the difference between removing the VM, NIC and osdisk manually using the azure UI and the way the autoscaler performs these steps?

There should be none.

Any help would be appreciated, thanks.

I would suggest opening a ticket on Azure specifying the region where your cluster is hosted and the VM size. This could happen for example when a certain type of VM is seeing too much demande in a single DC.

wbuchwalter avatar Jan 29 '18 16:01 wbuchwalter

Well, I did open a ticket to azure and managed to reproduce the error using the code in this gist - https://gist.github.com/yaron-idan/91a1193e40cb0da5ce42a106bf1a91e0 which is basically just the code the autoscaler is using duplicated. Support Escalation Engineer from Azure pointed out that using the latest dependencies fixes the issue, and testing it out shows that he is correct. Is there any objection to me opening a PR updating the required dependencies?

yaron-idan avatar Jan 30 '18 11:01 yaron-idan

Is there any objection to me opening a PR updating the required dependencies?

Absolutly not, go ahead.

wbuchwalter avatar Jan 30 '18 14:01 wbuchwalter

Are you still experiencing this issue? It completly disappeared on my side so it probably was an intermittent Azure issue.

wbuchwalter avatar Feb 22 '18 06:02 wbuchwalter

Well, for us the error is very much not intermittent, but it does persist so far on a very specific resource group (we have 2 clusters, and only one of them is encountering errors) Sadly I cannot upgrade the autoscaler to the latest version on that cluster to see if this solves the issue (I'm missing the etcdprivatekey param). I'll create a new cluster soon and see if it shows any symptoms, if it doesn't I'll close the issue. Thanks for the update!

yaron-idan avatar Feb 22 '18 07:02 yaron-idan

The latest version of the autoscaler doesn't have etcdprivatekey anymore, take a look at the helm-chart on master (you also don't need kubeprivatekey) anymore. So you can try this out.

wbuchwalter avatar Feb 22 '18 14:02 wbuchwalter