No datastore master nodes but cluster is still working
Summary
We are running a multi node microk8s cluster with three datastore nodes and multiple worker nodes. Yesterday I noticed that microk8s status reports no datastore nodes (neither master nor standby nodes). We don't know when and how it started. We also don't understand why the cluster is still working and if the data available to each kubelite api server is consistent.
I checked the code which generates the output and it makes the following call:
# /snap/microk8s/current/bin/dqlite -s file:///var/snap/microk8s/current/var/kubernetes/backend/cluster.yaml -c /var/snap/microk8s/current/var/kubernetes/backend/cluster.crt -k /var/snap/microk8s/current/var/kubernetes/backend/cluster.key -f json k8s .cluster
Error: no available dqlite leader server found
/var/snap/microk8s/current/var/kubernetes/backend/cluster.yaml contains the following data
- ID: 3297041220608546238
Address: 10.5.0.4:19001
Role: 2
- ID: 10360074122386368580
Address: 10.5.0.5:19001
Role: 2
- ID: 442776466333799327
Address: 10.5.0.10:19001
Role: 2
- ID: 6363631146440140535
Address: 10.5.0.11:19001
Role: 2
- ID: 6048809877879767952
Address: 10.10.5.11:19001
Role: 0
- ID: 2169217021260995856
Address: 10.10.5.12:19001
Role: 0
- ID: 11804631294055090830
Address: 10.10.5.13:19001
Role: 0
The first four entries are no longer valid as we migrated the cluster last year. After the migration microk8s status had the new nodes as datastore nodes and everything was looking fine.
The docs at https://microk8s.io/docs/restore-quorum don't seem to apply to this. Reconfigure returns an error code 1 without additional details.
What Should Happen Instead?
This should obviously not happen. All three dqlite instances are up and running. Apart from a few server reboots they have always been available.
I'm not quite sure how severe this desync is. If the dqlite instances are out of sync now the kubernetes API should probably prevent writes to the datastore.
Reproduction Steps
I don't know what triggered this issue.
Introspection Report
inspection-report-20250402_072540.tar.gz
Can you suggest a fix?
Are you interested in contributing with a fix?
Sure if I can - however I have no idea how to identify the underlying issue.
Hi @jnugh,
Thanks for filing your issue.
First of all, if you have not already done so, please create a backup of your database:
tar -cvf backup.tar /var/snap/microk8s/current/var/kubernetes/backend
When removing the nodes did you use the microk8s remove-node command?
Could you please try to run the remove-node command again as per our documentation at https://microk8s.io/docs/command-reference#heading--microk8s-remove-node?
Once those old nodes are properly removed, Dqlite should be able to restore quorum and elect a cluster leader. Once a Dqlite leader is elected Dqlite will be able to serve requests from the kubernetes API server again.
Best regards, Louise
Thanks @louiseschmidtgen,
Yeah we created regular backups on every machine since we noticed :D.
To be honest I'm not sure if we used remove-node back then or just removed the nodes from kubernetes.
# microk8s remove-node 10.5.0.4
Error from server (NotFound): nodes "10.5.0.4" not found
Node 10.5.0.4 does not exist in Kubernetes.
I just created a new dummy node resource in Kubernetes to continue testing:
apiVersion: v1
kind: Node
metadata:
labels:
beta.kubernetes.io/arch: amd64
beta.kubernetes.io/os: linux
kubernetes.io/arch: amd64
kubernetes.io/hostname: oldcp01.k8s.staging.medicalvalues.dev
kubernetes.io/os: linux
microk8s.io/cluster: "true"
node.kubernetes.io/microk8s-controlplane: microk8s-controlplane
name: oldcp01.k8s.staging.medicalvalues.dev
status:
addresses:
- address: 10.5.0.4
type: InternalIP
- address: oldcp01.k8s.staging.medicalvalues.dev
type: Hostname
But when I run the command I get:
# microk8s remove-node 10.5.0.4
Error: no available dqlite leader server found
Usage:
dqlite -s <servers> <database> [command] [flags]
Flags:
-c, --cert string public TLS cert
-f, --format string output format (tabular, json) (default "tabular")
-h, --help help for dqlite
-k, --key string private TLS key
-s, --servers strings comma-separated list of db servers, or file://<store>
--timeout uint timeout of each request (msec) (default 2000)
Node oldcp01.k8s.staging.medicalvalues.dev does not exist in Kubernetes.
I actually have a very similar problem: I have a 3 nodes cluster (all 3 are datastore nodes) and 2 nodes out of 3 report that there's no datastore, while the last one correctly reports the IPs of the 3 datastore nodes.
Don't know when it started neither, made no recent change to the cluster. Cluster seems to be working correctly when deploying new stuff on it. Wondering if applying the restore-quorum howto, taking the dqlite db from the node behaving correctly, would solve the issue.
In any case I think it worths trying, my nodes are Proxmox virtual machines so I can take a full snapshot of the VMs before trying to fix it (or break it further).
Edit: One more info: I rebooted all 3 nodes, one at a time, waiting for everything to come back up before doing the next node, and I still have 2 of the 3 nodes reporting no datastore, and 1 reporting everything correctly, but it's not the same node before and after the full restart of the cluster...
I am seeing the same issue on 1.32.3 (1.32/stable), the issue seems to fix itself immediately when I simply rollback the nodes to 1.31.7 (1.31/stable) using the following:
sudo snap refresh microk8s --channel=1.31/stable
sudo snap refresh microk8s --hold
So this issue was possibly introduced somewhere in version 1.32
I'm running into what seems to be the same issue...made aware of it when I needed to remove a failed node from a 4-node cluster (ssd failed, so no way to microk8s leave from that node first). Seems I have no dqlite leader (perhaps the failed node WAS the leader at the time, and no reelection has happened?):
root@amy:~# /snap/microk8s/current/bin/dqlite -s 127.0.0.1:19001 -c /var/snap/microk8s/current/var/kubernetes/backend/cluster.crt -k /var/snap/microk8s/current/var/kubernetes/backend/cluster.key k8s ".leader"
Error: no available dqlite leader server found
Usage:
dqlite -s <servers> <database> [command] [flags]
Flags:
-c, --cert string public TLS cert
-f, --format string output format (tabular, json) (default "tabular")
-h, --help help for dqlite
-k, --key string private TLS key
-s, --servers strings comma-separated list of db servers, or file://<store>
--timeout uint timeout of each request (msec) (default 2000)
microk8s status will toggle between reporting the nodes and HA status correctly (the standby node is the failed node):
root@amy:~# microk8s status
microk8s is running
high-availability: yes
datastore master nodes: 10.97.45.4:19001 10.97.45.5:19001 10.97.45.6:19001
datastore standby nodes: 10.97.45.3:19001
...and incorrectly (it hangs for 40 seconds when this happens):
root@amy:~# time microk8s status
microk8s is running
high-availability: no
datastore master nodes: none
datastore standby nodes: none
...<snip>...
real 0m40.806s
user 0m1.439s
sys 0m0.220s
According to my understanding, the issue here isn't a lack of quorum, but a lack of leadership? By all appearances, the k8s aspects are operating correctly.
OK for my issue, while I was nervous about running through these steps (especially as others reported it was ineffective in this particular issue), I took the leap and did so and I'm happy to report that I came out the other end with a once-again-functional microk8s cluster. I was able to remove my wayward/dead node without further issue afterwards and move on with my life lol.
A quick post to say that the "restore from lost quorum" howto (see link from @Ziris85 post) also allowed me to fix the issue I described above and also to remove the rogue "datastore standby node" I had.
So, everything's back to normal on my side too.
I have confirmed that the issue still existed when upgrading from 1.31 to 1.32, and when I tried upgrading 1.31 to 1.33 (after reverting to 1.31 and running that version for a couple of months). However, I then tried the steps linked by @Ziris85 and it has fixed the issue, even after retrying various upgrading/downgrading back-and-forth between versions 1.31, 1.32, and 1.33.
It turned out I also had invalid entries in my cluster.yaml file, left behind from a past migration. So for anyone else who stumbles in here from a Google search like me: Check your cluster.yaml carefully and try the how-to linked by @Ziris85 above.
We just noticed that the problem fixed itself somehow. We have no idea when this happened, probably a minor update contained a fix for this issue. We were still on 1.32.x because we didn't want to risk updating to a new major version in this state. I will close this issue as it has been fixed for us. Maybe it works for other situations as well now.