azure_arc AKS Clusters running on HCI Stack show offline and are not working

Note: For ease of issues and pull requests management and tracking, we kindly ask you to provide a meaningful and concise title to this issue and answer all questions to the best of your ability.

Is your issue related to a Jumpstart scenario, ArcBox, HCIBox, or Agora? My issue is related to AKS on HCI https://azurearcjumpstart.io/azure_jumpstart_hcibox/AKS

Describe the issue or the bug I deployed the environment in Australia East and added AKS to it. I then added a second AKS cluster via the GUI ( see below) . Both cluster were showing online and were accessible. I then switched off the HCI client to save on running costs. I restarted a couple of days ago and after a good ten minutes clusters came back online, with all green checks at the overview page. I left it running for a couple of days and now AKS clusters are offline. I restarted the HCI client but no change, AKS clusters are offline and not working.

To Reproduce I've noticed the AKS cluster breaking in previous installations too, there is no apparent reasons for it. The only action I took was to switch off the client after successful deployment. Restarting the client once brings the AKS cluster back , then it breaks ( i.e. goes offline and can't be accessed) for good.

Expected behavior AKS cluster is available at all times.

Environment summary Latest tools.

Have you looked at the Troubleshooting and Logs section?

Screenshots

Azure portal shows the below:

Additional context I noticed this behaviour in previous deployments, it's not the first time.

Aug 09 '24 10:08 LucaOtto

Hi All, we are also facing the same issue with jumpstart HCIBox. It was working before and started breaking from 3 days back.

Best Regards, Ajay

Aug 12 '24 18:08 ajaysubramanya86

Can you try running Repair-AksHciClusterCerts following this doc on the AKS cluster?

https://learn.microsoft.com/en-us/azure/aks/hybrid/reference/ps/repair-akshciclustercerts

Aug 14 '24 12:08 dkirby-ms

Thanks Dale. I tried but i get the below error

Aug 14 '24 18:08 ajaysubramanya86

Can you try running Repair-AksHciClusterCerts following this doc on the AKS cluster?

https://learn.microsoft.com/en-us/azure/aks/hybrid/reference/ps/repair-akshciclustercerts

@dkirby-ms , should this be run on AZHOST1? Thanks!

Aug 15 '24 10:08 LucaOtto

Can you try running Repair-AksHciClusterCerts following this doc on the AKS cluster?

https://learn.microsoft.com/en-us/azure/aks/hybrid/reference/ps/repair-akshciclustercerts

@dkirby-ms I ran the command on AZHOST1 and got this error back:

Aug 19 '24 14:08 LucaOtto

Thanks. I've got an environment up with a working AKS cluster. To try to repro, I will shut down the HCIBox VM, restart it, and check to see what happens with the AKS cluster.

Aug 20 '24 13:08 dkirby-ms

Thanks. I've got an environment up with a working AKS cluster. To try to repro, I will shut down the HCIBox VM, restart it, and check to see what happens with the AKS cluster.

@dkirby-ms , I'm happy to a screen share so you can see live what happens in my environment. Thanks for looking into this.

Aug 20 '24 13:08 LucaOtto

I've reproduced the issue. I was not able to recover the AKS workload cluster, but I was able to delete the cluster and recreate a new one without the issue.

I believe the issue is related to restarting the Azure host. The VM-Router is not available before the HCI nodes recover, and AKS networking fails due to VM-Router being down.

I am investigating a "graceful" shutdown/restart process that may mitigate this issue in the future. However, the root cause is shutting down the Azure host.

The only current resolution for this is to leave the host running until we have published a "graceful shutdown" process for HCIBox. This process will look roughly like this:

Shutdown:

Shutdown HCI cluster by putting one cluster node in maintenance/drain mode
Wait for storage to sync
Shutdown cluster from remaining node
Shutdown AZSMGMT VM
Shutdown Azure VM

Restart:

Start Azure VM
Start AZSMGMT VM
Start cluster nodes
Check cluster health

There are going to be some other caveats too, such as ARBs ability to recover gracefully. We will need to evaluate that separately from this issue.

Aug 20 '24 21:08 dkirby-ms

@dkirby-ms Thanks for the above analysis..as per the above there is no way to recover the AKS cluster....unfortunately we had most of our services installed there hence please let us know if there is any way we can recover....in parallel we will also try installing our services in the new AKS in the new HCIBox we created..

Aug 22 '24 10:08 ajaysubramanya86

Since its a Kubernetes cluster, couldnt you just redeploy the services that were running on it via the original manifests?

Aug 22 '24 12:08 dkirby-ms

I've reproduced the issue. I was not able to recover the AKS workload cluster, but I was able to delete the cluster and recreate a new one without the issue.

I believe the issue is related to restarting the Azure host. The VM-Router is not available before the HCI nodes recover, and AKS networking fails due to VM-Router being down.

I am investigating a "graceful" shutdown/restart process that may mitigate this issue in the future. However, the root cause is shutting down the Azure host.

The only current resolution for this is to leave the host running until we have published a "graceful shutdown" process for HCIBox. This process will look roughly like this:

Shutdown:

Shutdown HCI cluster by putting one cluster node in maintenance/drain mode

Wait for storage to sync

Shutdown cluster from remaining node

Shutdown AZSMGMT VM

Shutdown Azure VM

Restart:

Start Azure VM

Start AZSMGMT VM

Start cluster nodes

Check cluster health

There are going to be some other caveats too, such as ARBs ability to recover gracefully. We will need to evaluate that separately from this issue.

@dkirby-ms , is it ok to delete the broken cluster(s) from the portal and recreate a new one again from the it or is Powershell the recommended way ? As soon as the main host is switched off, the cluster will break again, so I'm looking for the most effective way to rebuild a cluster on the fly in the most efficient way. Thank you again for looking into this.

Aug 22 '24 14:08 LucaOtto

I have deleted clusters from the portal and recreated without issue before so I think you can try that. It may be helpful to use a different cluster name just to avoid any potential naming conflicts.

Aug 22 '24 14:08 dkirby-ms

@dkirby-ms i agree but we had quite a few dependent services like SQLMI,PostGre,FluxCD,Kafka,Istio and some more which took a lot of time and effort....we also wanted to investigate on our pods which was not starting up when this happened.. Now when we recreate those clusters all these efforts to setup and test comes into play and if the cluster goes down again then all this needs to be done again

Aug 23 '24 06:08 ajaysubramanya86

Hi @@.***>,

what is the resolution for this? Recreate the cluster every time?

Thanks

Luca Ottonari

Thank you and Kind Regards

AZURE CLOUD SOLUTION ARCHITECT @.***

[Microsoft Logo]

Sep 30 '24 15:09 LucaOtto

That's correct, we don't have a "graceful shutdown" option for the vm host for HCIBox right now. Until then there is no other workaround we know of to recover the AKS cluster.

Dale Kirby Principal Partner Solution Architect Microsoft Global Partner Solutions

Sep 30 '24 15:09 LucaOtto