microk8s No datastore master nodes but cluster is still working

Summary

We are running a multi node microk8s cluster with three datastore nodes and multiple worker nodes. Yesterday I noticed that microk8s status reports no datastore nodes (neither master nor standby nodes). We don't know when and how it started. We also don't understand why the cluster is still working and if the data available to each kubelite api server is consistent.

I checked the code which generates the output and it makes the following call:

# /snap/microk8s/current/bin/dqlite  -s file:///var/snap/microk8s/current/var/kubernetes/backend/cluster.yaml  -c /var/snap/microk8s/current/var/kubernetes/backend/cluster.crt -k /var/snap/microk8s/current/var/kubernetes/backend/cluster.key  -f json k8s .cluster
Error: no available dqlite leader server found

/var/snap/microk8s/current/var/kubernetes/backend/cluster.yaml contains the following data

- ID: 3297041220608546238
  Address: 10.5.0.4:19001
  Role: 2
- ID: 10360074122386368580
  Address: 10.5.0.5:19001
  Role: 2
- ID: 442776466333799327
  Address: 10.5.0.10:19001
  Role: 2
- ID: 6363631146440140535
  Address: 10.5.0.11:19001
  Role: 2
- ID: 6048809877879767952
  Address: 10.10.5.11:19001
  Role: 0
- ID: 2169217021260995856
  Address: 10.10.5.12:19001
  Role: 0
- ID: 11804631294055090830
  Address: 10.10.5.13:19001
  Role: 0

The first four entries are no longer valid as we migrated the cluster last year. After the migration microk8s status had the new nodes as datastore nodes and everything was looking fine.

The docs at https://microk8s.io/docs/restore-quorum don't seem to apply to this. Reconfigure returns an error code 1 without additional details.

What Should Happen Instead?

This should obviously not happen. All three dqlite instances are up and running. Apart from a few server reboots they have always been available.

I'm not quite sure how severe this desync is. If the dqlite instances are out of sync now the kubernetes API should probably prevent writes to the datastore.

Reproduction Steps

I don't know what triggered this issue.

Introspection Report

inspection-report-20250402_072540.tar.gz

Can you suggest a fix?

Are you interested in contributing with a fix?

Sure if I can - however I have no idea how to identify the underlying issue.

Apr 02 '25 07:04 jnugh

Hi @jnugh,

Thanks for filing your issue.

First of all, if you have not already done so, please create a backup of your database:

tar -cvf backup.tar /var/snap/microk8s/current/var/kubernetes/backend

When removing the nodes did you use the microk8s remove-node command? Could you please try to run the remove-node command again as per our documentation at https://microk8s.io/docs/command-reference#heading--microk8s-remove-node?

Once those old nodes are properly removed, Dqlite should be able to restore quorum and elect a cluster leader. Once a Dqlite leader is elected Dqlite will be able to serve requests from the kubernetes API server again.

Best regards, Louise

Apr 04 '25 11:04 louiseschmidtgen

Thanks @louiseschmidtgen,

Yeah we created regular backups on every machine since we noticed :D.

To be honest I'm not sure if we used remove-node back then or just removed the nodes from kubernetes.

# microk8s remove-node 10.5.0.4
Error from server (NotFound): nodes "10.5.0.4" not found
Node 10.5.0.4 does not exist in Kubernetes.

I just created a new dummy node resource in Kubernetes to continue testing:

apiVersion: v1
kind: Node
metadata:
  labels:
    beta.kubernetes.io/arch: amd64
    beta.kubernetes.io/os: linux
    kubernetes.io/arch: amd64
    kubernetes.io/hostname: oldcp01.k8s.staging.medicalvalues.dev
    kubernetes.io/os: linux
    microk8s.io/cluster: "true"
    node.kubernetes.io/microk8s-controlplane: microk8s-controlplane
  name: oldcp01.k8s.staging.medicalvalues.dev
status:
  addresses:
    - address: 10.5.0.4
      type: InternalIP
    - address: oldcp01.k8s.staging.medicalvalues.dev
      type: Hostname

But when I run the command I get:

# microk8s remove-node 10.5.0.4
Error: no available dqlite leader server found
Usage:
  dqlite -s <servers> <database> [command] [flags]

Flags:
  -c, --cert string       public TLS cert
  -f, --format string     output format (tabular, json) (default "tabular")
  -h, --help              help for dqlite
  -k, --key string        private TLS key
  -s, --servers strings   comma-separated list of db servers, or file://<store>
      --timeout uint      timeout of each request (msec) (default 2000)

Node oldcp01.k8s.staging.medicalvalues.dev does not exist in Kubernetes.

Apr 04 '25 12:04 jnugh

I actually have a very similar problem: I have a 3 nodes cluster (all 3 are datastore nodes) and 2 nodes out of 3 report that there's no datastore, while the last one correctly reports the IPs of the 3 datastore nodes.

Don't know when it started neither, made no recent change to the cluster. Cluster seems to be working correctly when deploying new stuff on it. Wondering if applying the restore-quorum howto, taking the dqlite db from the node behaving correctly, would solve the issue.

In any case I think it worths trying, my nodes are Proxmox virtual machines so I can take a full snapshot of the VMs before trying to fix it (or break it further).

Edit: One more info: I rebooted all 3 nodes, one at a time, waiting for everything to come back up before doing the next node, and I still have 2 of the 3 nodes reporting no datastore, and 1 reporting everything correctly, but it's not the same node before and after the full restart of the cluster...

Apr 24 '25 08:04 zoc

I am seeing the same issue on 1.32.3 (1.32/stable), the issue seems to fix itself immediately when I simply rollback the nodes to 1.31.7 (1.31/stable) using the following:

sudo snap refresh microk8s --channel=1.31/stable
sudo snap refresh microk8s --hold

So this issue was possibly introduced somewhere in version 1.32

Jun 16 '25 00:06 NerdyGriffin

I'm running into what seems to be the same issue...made aware of it when I needed to remove a failed node from a 4-node cluster (ssd failed, so no way to microk8s leave from that node first). Seems I have no dqlite leader (perhaps the failed node WAS the leader at the time, and no reelection has happened?):

root@amy:~# /snap/microk8s/current/bin/dqlite   -s 127.0.0.1:19001   -c /var/snap/microk8s/current/var/kubernetes/backend/cluster.crt   -k /var/snap/microk8s/current/var/kubernetes/backend/cluster.key   k8s ".leader"
Error: no available dqlite leader server found
Usage:
  dqlite -s <servers> <database> [command] [flags]

Flags:
  -c, --cert string       public TLS cert
  -f, --format string     output format (tabular, json) (default "tabular")
  -h, --help              help for dqlite
  -k, --key string        private TLS key
  -s, --servers strings   comma-separated list of db servers, or file://<store>
      --timeout uint      timeout of each request (msec) (default 2000)

microk8s status will toggle between reporting the nodes and HA status correctly (the standby node is the failed node):

root@amy:~# microk8s status
microk8s is running
high-availability: yes
  datastore master nodes: 10.97.45.4:19001 10.97.45.5:19001 10.97.45.6:19001
  datastore standby nodes: 10.97.45.3:19001

...and incorrectly (it hangs for 40 seconds when this happens):

root@amy:~# time microk8s status
microk8s is running
high-availability: no
  datastore master nodes: none
  datastore standby nodes: none
...<snip>...
real    0m40.806s
user    0m1.439s
sys     0m0.220s

According to my understanding, the issue here isn't a lack of quorum, but a lack of leadership? By all appearances, the k8s aspects are operating correctly.

Jun 26 '25 01:06 Ziris85

OK for my issue, while I was nervous about running through these steps (especially as others reported it was ineffective in this particular issue), I took the leap and did so and I'm happy to report that I came out the other end with a once-again-functional microk8s cluster. I was able to remove my wayward/dead node without further issue afterwards and move on with my life lol.

Jul 08 '25 02:07 Ziris85

A quick post to say that the "restore from lost quorum" howto (see link from @Ziris85 post) also allowed me to fix the issue I described above and also to remove the rogue "datastore standby node" I had.

So, everything's back to normal on my side too.

Jul 22 '25 09:07 zoc

I have confirmed that the issue still existed when upgrading from 1.31 to 1.32, and when I tried upgrading 1.31 to 1.33 (after reverting to 1.31 and running that version for a couple of months). However, I then tried the steps linked by @Ziris85 and it has fixed the issue, even after retrying various upgrading/downgrading back-and-forth between versions 1.31, 1.32, and 1.33.

It turned out I also had invalid entries in my cluster.yaml file, left behind from a past migration. So for anyone else who stumbles in here from a Google search like me: Check your cluster.yaml carefully and try the how-to linked by @Ziris85 above.

Aug 20 '25 23:08 NerdyGriffin

We just noticed that the problem fixed itself somehow. We have no idea when this happened, probably a minor update contained a fix for this issue. We were still on 1.32.x because we didn't want to risk updating to a new major version in this state. I will close this issue as it has been fixed for us. Maybe it works for other situations as well now.

Sep 29 '25 11:09 jnugh