AKS Clusters running on HCI Stack show offline and are not working
Note: For ease of issues and pull requests management and tracking, we kindly ask you to provide a meaningful and concise title to this issue and answer all questions to the best of your ability.
Is your issue related to a Jumpstart scenario, ArcBox, HCIBox, or Agora? My issue is related to AKS on HCI https://azurearcjumpstart.io/azure_jumpstart_hcibox/AKS
Describe the issue or the bug I deployed the environment in Australia East and added AKS to it. I then added a second AKS cluster via the GUI ( see below) . Both cluster were showing online and were accessible. I then switched off the HCI client to save on running costs. I restarted a couple of days ago and after a good ten minutes clusters came back online, with all green checks at the overview page. I left it running for a couple of days and now AKS clusters are offline. I restarted the HCI client but no change, AKS clusters are offline and not working.
To Reproduce I've noticed the AKS cluster breaking in previous installations too, there is no apparent reasons for it. The only action I took was to switch off the client after successful deployment. Restarting the client once brings the AKS cluster back , then it breaks ( i.e. goes offline and can't be accessed) for good.
Expected behavior AKS cluster is available at all times.
Environment summary Latest tools.
Have you looked at the Troubleshooting and Logs section?
Screenshots
Azure portal shows the below:
Additional context I noticed this behaviour in previous deployments, it's not the first time.
Hi All, we are also facing the same issue with jumpstart HCIBox. It was working before and started breaking from 3 days back.
Best Regards, Ajay
Can you try running Repair-AksHciClusterCerts following this doc on the AKS cluster?
https://learn.microsoft.com/en-us/azure/aks/hybrid/reference/ps/repair-akshciclustercerts
Thanks Dale.
I tried but i get the below error
Can you try running Repair-AksHciClusterCerts following this doc on the AKS cluster?
https://learn.microsoft.com/en-us/azure/aks/hybrid/reference/ps/repair-akshciclustercerts
@dkirby-ms , should this be run on AZHOST1? Thanks!
Can you try running Repair-AksHciClusterCerts following this doc on the AKS cluster?
https://learn.microsoft.com/en-us/azure/aks/hybrid/reference/ps/repair-akshciclustercerts
@dkirby-ms I ran the command on AZHOST1 and got this error back:
Thanks. I've got an environment up with a working AKS cluster. To try to repro, I will shut down the HCIBox VM, restart it, and check to see what happens with the AKS cluster.
Thanks. I've got an environment up with a working AKS cluster. To try to repro, I will shut down the HCIBox VM, restart it, and check to see what happens with the AKS cluster.
@dkirby-ms , I'm happy to a screen share so you can see live what happens in my environment. Thanks for looking into this.
I've reproduced the issue. I was not able to recover the AKS workload cluster, but I was able to delete the cluster and recreate a new one without the issue.
I believe the issue is related to restarting the Azure host. The VM-Router is not available before the HCI nodes recover, and AKS networking fails due to VM-Router being down.
I am investigating a "graceful" shutdown/restart process that may mitigate this issue in the future. However, the root cause is shutting down the Azure host.
The only current resolution for this is to leave the host running until we have published a "graceful shutdown" process for HCIBox. This process will look roughly like this:
Shutdown:
- Shutdown HCI cluster by putting one cluster node in maintenance/drain mode
- Wait for storage to sync
- Shutdown cluster from remaining node
- Shutdown AZSMGMT VM
- Shutdown Azure VM
Restart:
- Start Azure VM
- Start AZSMGMT VM
- Start cluster nodes
- Check cluster health
There are going to be some other caveats too, such as ARBs ability to recover gracefully. We will need to evaluate that separately from this issue.
@dkirby-ms Thanks for the above analysis..as per the above there is no way to recover the AKS cluster....unfortunately we had most of our services installed there hence please let us know if there is any way we can recover....in parallel we will also try installing our services in the new AKS in the new HCIBox we created..
Since its a Kubernetes cluster, couldnt you just redeploy the services that were running on it via the original manifests?
I've reproduced the issue. I was not able to recover the AKS workload cluster, but I was able to delete the cluster and recreate a new one without the issue.
I believe the issue is related to restarting the Azure host. The VM-Router is not available before the HCI nodes recover, and AKS networking fails due to VM-Router being down.
I am investigating a "graceful" shutdown/restart process that may mitigate this issue in the future. However, the root cause is shutting down the Azure host.
The only current resolution for this is to leave the host running until we have published a "graceful shutdown" process for HCIBox. This process will look roughly like this:
Shutdown:
- Shutdown HCI cluster by putting one cluster node in maintenance/drain mode
- Wait for storage to sync
- Shutdown cluster from remaining node
- Shutdown AZSMGMT VM
- Shutdown Azure VM
Restart:
- Start Azure VM
- Start AZSMGMT VM
- Start cluster nodes
- Check cluster health
There are going to be some other caveats too, such as ARBs ability to recover gracefully. We will need to evaluate that separately from this issue.
@dkirby-ms , is it ok to delete the broken cluster(s) from the portal and recreate a new one again from the it or is Powershell the recommended way ? As soon as the main host is switched off, the cluster will break again, so I'm looking for the most effective way to rebuild a cluster on the fly in the most efficient way. Thank you again for looking into this.
I have deleted clusters from the portal and recreated without issue before so I think you can try that. It may be helpful to use a different cluster name just to avoid any potential naming conflicts.
@dkirby-ms i agree but we had quite a few dependent services like SQLMI,PostGre,FluxCD,Kafka,Istio and some more which took a lot of time and effort....we also wanted to investigate on our pods which was not starting up when this happened.. Now when we recreate those clusters all these efforts to setup and test comes into play and if the cluster goes down again then all this needs to be done again
Hi @@.***>,
what is the resolution for this? Recreate the cluster every time?
Thanks
Luca Ottonari
Thank you and Kind Regards
AZURE CLOUD SOLUTION ARCHITECT @.***
[Microsoft Logo]
That's correct, we don't have a "graceful shutdown" option for the vm host for HCIBox right now. Until then there is no other workaround we know of to recover the AKS cluster.
Dale Kirby Principal Partner Solution Architect Microsoft Global Partner Solutions