microk8s
microk8s copied to clipboard
Microk8s ha-cluster Add-on Causing Request Timeouts
Been running microk8s on two Raspberry PIs (Raspberry PI 4) for over a year and it generally always has been running great. However I have recently started experiencing random timeouts when making http requests to services within the cluster.
Thinking there might be an issue with the ingress controller I decided to configure a NodePort service to one of the deployments to see if the timeouts also occur when calling into the NodePort service directly and, it turns out that making requests to the NodePort service directly is also resulting in random timeouts...
One thing I noticed has changed recently is that the cluster is now using calico (before it wasn't - was using flanneld I think) for networking and when I do microk8s status
I see that the add-on ha-cluster is enabled? I always thought this add-on was disabled by default as it is only useful if running on a cluster with three nodes at least?
So thinking that the ha-cluster
add-on or calico is what may be causing the problem I decided to do microk8s disable ha-cluster
.
After doing this my two node cluster became separated and now the two nodes are no longer aware of each other. Also, the master node now shows no resources at all when doing kubectl get all --all-namespaces
. So effectively I am now needing to reconstruct everything and put the cluster back together again :o/
Anyway, now that my master node no longer has ha-cluster add-on enabled, I redeployed a pod to the k8s cluster and recreated the NodePort service and now the http requests are responding swiftly again and no request timeouts occur any more.
I have reconstructed the whole cluster by rejoining the second node, redeployed all resources and can confirm that request timeouts are no longer occuring. i.e. with ha-cluster disabled.
So my questions are:
- What is the current status of ha-cluster?
- Why was ha-cluster add-on enabled on my cluster? Can other add-ons maybe enable ha-cluster add-on? I ask this because one add-on I added recently was metallb so wondering if it was metallb add-on that enabled ha-cluster???
- Has anyone else experienced this kind of issue with ha-cluster add-on enabled?
- How can I avoid ha-cluster being enabled again in future?
Thanks in advance, Pm
Apologies for the experience you are having.
What is the current status of ha-cluster?
Ha-cluster was the default from v 1.19 (i think), prior to that its using etcd + flannel.
Why was ha-cluster add-on enabled on my cluster?
I don't recall there is an explicit enabling of HA when a cluster is not in HA prior to an update. I know there's been a lot of care to maintain the current configuration of the system when it is being upgraded.
Can other add-ons maybe enable ha-cluster add-on? I ask this because one add-on I added recently was metallb so wondering if it was metallb add-on that enabled ha-cluster???
Certainly no addon will enable HA implicitly. Has anyone else experienced this kind of issue with ha-cluster add-on enabled?
How can I avoid ha-cluster being enabled again in future?
Enabling HA is an explicit action.
Btw may i know what microk8s channel are you in? If you are using stable
and as far as i can tell, you are running a long lasting cluster, the recommendation is to use a specific channel.
For example 1.23/stable
.
Using stable
will implicitly upgrade your cluster to the latest released version of kubernetes. These versions may have breaking changes.
@neoaggelos or @ktsakalozos feel free to add in anything else i missed.
Thank you for reporting the issue and sorry for the not great experience with Calico.
What operating system are you using? I think for Calico you might also need to install the linux-modules-extra-$(uname -r)
package.
Also, are there any warnings in the output of microk8s inspect
?
@balchua, when the issue started happening I was running either 1.21/stable
or 1.22/stable
. I then upgraded to 1.23/stable
hoping it would help and restarted the nodes but still, the random timeouts would still occur.
@neoaggelos, when the timeouts started occurring, I did a microk8s inspect
but no obvious warnings or errors showed up. The only thing that stood out was that ha-cluster was enabled when doing a microk8s status
. The os on the Raspberry PIs is: Linux pi-k8s-00 5.4.0-1059-raspi 67-Ubuntu SMP PREEMPT Mon Apr 11 14:16:01 UTC 2022 aarch64 aarch64 aarch64 GNU/Linux
To offer more detail, when the issue started I began by unpicking various things one at a time:
- Disabled metallb - timeouts would still occur.
- Added NodePort services direct to a couple of services (i.e. to bypass Ingress Controller and therefore rule out NGINX misconfig as a possible problem) - timeouts would still occur.
- Disabled ha-cluster - Caused whole cluster resources to disappear and worker node to leave the cluster (Not sure why all this happened). At this point I was forced to re-build the cluster and redeploy all resources.
- Once cluster was up and running again (This time
ha-cluster
not enabled - i.e. flanneld in use instead), timeouts no longer occurred.
When the timeouts were occurring, I deployed a simple pod app to test which consisted of only CMD python -m http.server 8080
service and a 1.3MB moon image just to rule out s/w bugs in the full-fledged services but this simple test pod was also periodically timing out.
The test (run in microk8s master node):
wget http://<node-ip>:<node-port>/moon.jpg --read-timeout=5
Would basically fail (timeout) between 0-4 times and on last attempt would successfully fetch the image rapidly (in 0.2s or less).
Extending the read timeout to e.g. 2 mins (read-timeout=120
) again it would fail between 0-4 times and on last attempt fetch the image rapidly (in 0.2s or less).
So it seemed to me like it was a connection issue more than a bandwidth issue. i.e. when it could connect then it would then service the request promptly/successfully.
At all times the two node hosts (Raspberry PI 4s) were showing normal healthy CPU/memory/disk/network use.
Thanks again, PM
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.