aks-hybrid
aks-hybrid copied to clipboard
[BUG] Error connecting to AKS-HCI service host
Describe the bug Issue was reported by a partner, Dell Corp.
When I RDP to one of the nodes in my Azure Stack HCI cluster, I ran the Get-AksHciCluster command and got the error that the established connection failed because host failed to respond.
Probing further, an operator can not access the mgmt cluster using the kubeconfig-mgmt. The commands will fail with an error like: Unable to connect to the server: dial tcp 172.168.10.0:6443. where 172.168.10.0 is the IP of the control plane.
Certain powershell commands that use the kubeconfig-mgmt will fail with an error similar to : Unable to connect to the server: dial tcp 172.168.10.0:6443. where 172.168.10.0 is the IP of the control plane.
Additional context The kube-vip pod that advertises the control plane IP may be down. The pod will restart and the k8s API server may be available intermittently until it crashes again.
I had the same issue some weeks ago. I've noticed out of memory errors at the management cluster VM:
Same root cause in your case?
The mgmt VM gets 8GB by default, but Hyper-V shows a memory demand of 22GB. I've set the mgmt VM in Hyper-V to 32GB and everything works fine for the last 3 weeks. I've reported the issue to the Aks-Hci PMs, may there is already an open bug.
@Elektronenvolt what version of AksHci are you on?
The management cluster should not need that much memory.
@zawachte-msft I've seen the OOM issue the first time at the 06/2021 release, two weeks after initial setup. Right now I'm running the latest July release.
No OOM issues so far, but I've set the mgmt cluster VM to 32 GB after initial setup.
@Elektronenvolt - Since Aug 2021, we've had multiple new releases of AKS-HCI. Can you please try with the latest version and let us know if you still hit this issue?
@abhilashaagarwala - I didn't see the issue with any 2022 release anymore, but I didn't watch out for OOM errors. I'll verify it with the September 2022 release and let you know.