microk8s no kind "Deployment" is registered for version "apps\x10v1" in scheme

I am suddenly having a very strange issue with my 4 node microk8s cluster. I am unable to list all Deployments in only the default namespace. Other namespaces are fine, as is getting the details of a specific deployment.

This works

$ kubectl get deployment/nffc-worker
NAME          READY   UP-TO-DATE   AVAILABLE   AGE
nffc-worker   3/3     3            3           203d

As does this

$ kubectl get deployments -n userservices
NAME            READY   UP-TO-DATE   AVAILABLE   AGE
invbot-server   2/2     2            2           5d3h

However, as soon as I try to list all deployments in default, I get this error from kubectl

$ kubectl get deployments
Error from server: no kind "Deployment" is registered for version "apps\x10v1" in scheme "pkg/runtime/scheme.go:100"

The kubelite logs show this failure internally

Jun 05 11:22:47 node1 microk8s.daemon-kubelite[1059653]: W0605 11:22:47.834995 1059653 reflector.go:535] storage/cacher.go:/deployments: failed to list *apps.Deployment: no kind "Deployment" is registered for version "apps\x10v1" in scheme "pkg/runtime/scheme.go:100"

Obviously apps\x10v1 is wrong, as it should be apps/v1 (or maybe apps\v1 is acceptable as well). But I cannot figure out where this corrupted value is coming from. How can I figure out which Deployment in my configuration has this corrupted value and repair it?

Worse, because the system can no longer list all deployments, all resources in the default namespace now seem to be frozen. Things like kubectl rollout restart deployment don't finish their restart work, and even explicitly deleting a deployment doesn't remove its pods from the cluster.

Jun 05 '24 16:06 mlb5000

I should note that

This affects ALL nodes in the cluster
I have restarted microk8s on all nodes using sudo snap restart microk8s, but it did not fix anything

Jun 05 '24 16:06 mlb5000

Ok, so I managed to isolate the corrupted deployment configuration. Somehow there is a corrupted protocol buffer in the dqlite database.

Isolate the corrupted service

On any of the nodes, run

sudo /snap/microk8s/current/bin/dqlite \
  --cert /var/snap/microk8s/current/var/kubernetes/backend/cluster.crt \
  --key /var/snap/microk8s/current/var/kubernetes/backend/cluster.key \
  --servers file:////var/snap/microk8s/current/var/kubernetes/backend/cluster.yaml \
  k8s

Then in dqlite run

dqlite> select name from kine where name like '%deployments/default%';

I then copied the deployment names, dropped them in to Sublime text, and created a script with a bunch of lines that look like this:

echo "search" && microk8s kubectl get deployments/search-worker -o yaml | grep "apiVersion:"

This will error on the specific deployment that is causing the problem, and print apiVersion: apps/v1 for everything else.

View the configuration

Back in dqlite, grab the BLOB data for that particular bad registry entry, and the BLOB data for a good record while you're at it. This data is stored as an ASCII buffer.

The bad record's data starts with 107 56 115 0 10 21 10 7 97 112 112 115 16 118 49, the latter part of which reads as apps\x10v1.

The good record's data starts with 107 56 115 0 10 21 10 7 97 112 112 115 47 118 49, the latter part of which reads as apps/v1, which is what we want.

There doesn't appear to be any other corruption in here, but even if there is, it's this first part of this protocol buffer that I need to fix. Then I can just delete and recreate the deployment through the API as expected.

Basically, I either need to patch that 16 with a 47 in the dqlite database, or find a way to remove that Registry entry. However, I'm not sure how to do this in a way where the change will propagate to the other nodes like it's supposed to.

Jun 05 '24 18:06 mlb5000

Explicitly deleting that record in the dqlite database unstuck the deployment lifecycle across the entire cluster, and things are now back in working order.

However, someone from the microk8s team should look into this, since it feels very wrong to me that a protocol buffer that has been corrupted should ever find its way into the dqlite database. Especially if this corruption results in completely knocking out basic reliability/recovery functionality.

Jun 05 '24 18:06 mlb5000

Basically, the root cause here seems to be the dqlite record being persisted with a resource type + version combination that does not exist in kubectl api-resources.

Feels like the solution here is two-fold

the resource KIND + APIVERSION combination should be validated prior to persistence
apiserver should be updated to be more resilient to record corruption like this. Just because a single deployment record could not be read should not prevent things like list commands from succeeding.

I don't know if microk8s has its own apiserver implementation, or if this issue really belongs in the Kubernetes mainline, but a single corrupted byte in a single record in the dqlite database shouldn't have such an outsized effect on the platform.

Jun 05 '24 19:06 mlb5000