kind icon indicating copy to clipboard operation
kind copied to clipboard

"x509: certificate signed by unknown authority" on a fresh new kind cluster running alongside minikube --vm-driver=none

Open ensonic opened this issue 2 years ago β€’ 10 comments

What happened:

$ curl -Lo ~/bin/kind "https://kind.sigs.k8s.io/dl/v0.11.1/kind-$(uname)-amd64"
$ chmod +x ~/bin/kind
$ kind create cluster
Creating cluster "kind" ...
 βœ“ Ensuring node image (kindest/node:v1.21.1) πŸ–Ό 
 βœ“ Preparing nodes πŸ“¦  
 βœ“ Writing configuration πŸ“œ 
 βœ“ Starting control-plane πŸ•ΉοΈ 
 βœ“ Installing CNI πŸ”Œ 
 βœ“ Installing StorageClass πŸ’Ύ 
Set kubectl context to "kind-kind"
You can now use your cluster with:

kubectl cluster-info --context kind-kind

Have a question, bug, or feature request? Let us know! https://kind.sigs.k8s.io/#community πŸ™‚

$ kubectl --context=kind-kind get pods --all-namespaces 
NAMESPACE            NAME                                         READY   STATUS             RESTARTS   AGE
kube-system          coredns-558bd4d5db-b8qs8                     0/1     Running            0          17m
kube-system          coredns-558bd4d5db-tfn48                     0/1     Running            0          17m
kube-system          etcd-kind-control-plane                      1/1     Running            0          18m
kube-system          kindnet-bv86g                                1/1     Running            0          17m
kube-system          kube-apiserver-kind-control-plane            1/1     Running            0          18m
kube-system          kube-controller-manager-kind-control-plane   1/1     Running            0          18m
kube-system          kube-proxy-btkd2                             1/1     Running            0          17m
kube-system          kube-scheduler-kind-control-plane            1/1     Running            0          18m
local-path-storage   local-path-provisioner-547f784dff-57n2l      0/1     CrashLoopBackOff   8          17m

$ kubectl --context=kind-kind logs -n local-path-storage local-path-provisioner-547f784dff-57n2l 
time="2021-10-26T12:53:43Z" level=fatal msg="Error starting daemon: Cannot start Provisioner: failed to get Kubernetes server version: Get https://10.96.0.1:443/version?timeout=32s: x509: certificate signed by unknown authority" 

$ kubectl --context=kind-kind logs -n kube-system coredns-558bd4d5db-b8qs8
...
E1026 12:55:58.167556       1 reflector.go:127] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:156: Failed to watch *v1.Endpoints: failed to list *v1.Endpoints: Get "https://10.96.0.1:443/api/v1/endpoints?limit=500&resourceVersion=0": x509: certificate signed by unknown authority
E1026 12:56:03.319519       1 reflector.go:127] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:156: Failed to watch *v1.Namespace: failed to list *v1.Namespace: Get "https://10.96.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0": x509: certificate signed by unknown authority
E1026 12:56:05.587305       1 reflector.go:127] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:156: Failed to watch *v1.Service: failed to list *v1.Service: Get "https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0": x509: certificate signed by unknown authority
[INFO] plugin/ready: Still waiting on: "kubernetes"
[INFO] plugin/ready: Still waiting on: "kubernetes"

I checked issues already reported: https://github.com/kubernetes-sigs/kind/issues?q=is%3Aissue+%22x509%3A+certificate+signed+by+unknown+authority%22 but this seem different :/ Also FYI: I am running a bare-metal minikube on the same machine (vm-driver=none).

Environment:

$ kind version
kind v0.11.1 go1.16.4 linux/amd64
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.2", GitCommit:"faecb196815e248d3ecfb03c680a4507229c2a56", GitTreeState:"archive", BuildDate:"2021-06-13T07:08:18Z", GoVersion:"go1.15.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"clean", BuildDate:"2021-05-21T23:01:33Z", GoVersion:"go1.16.4", Compiler:"gc", Platform:"linux/amd64"}

docker info
Client:
 Context:    default
 Debug Mode: false
 Plugins:
  app: Docker App (Docker Inc., v0.9.1-beta3)
  buildx: Build with BuildKit (Docker Inc., v0.5.1-docker)

Server:
 Containers: 174
  Running: 73
  Paused: 0
  Stopped: 101
 Images: 40
 Server Version: 20.10.2
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc nvidia nvidia-experimental
 Default Runtime: nvidia
 Init Binary: docker-init
 containerd version: 269548fa27e0089a8b8278fc4fc781d7f65a939b
 runc version: ff819c7e9184c13b7c2607fe6c30ae19403a7aff
 init version: de40ad0
 Security Options:
  seccomp
   Profile: default
 Kernel Version: 5.10.46-5rodete1-amd64
 Operating System: Debian GNU/Linux rodete
 OSType: linux
 Architecture: x86_64
 CPUs: 24
 Total Memory: 62.81GiB
 Name: ensonic.muc.corp.google.com
 ID: FE3G:TCOF:UPXI:K3OY:S7JL:ZTOK:H7Q4:YQS2:AKT4:IVWN:NUDI:54ZE
 Docker Root Dir: /usr/local/google/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Registry Mirrors:
  https://mirror.gcr.io/
 Live Restore Enabled: false

ensonic avatar Oct 26 '21 13:10 ensonic

two things I see in the logs (kind export logs):

# kind-control-plane/journal.log
Oct 26 13:16:18 kind-control-plane containerd[177]: time="2021-10-26T13:16:18.873978746Z" level=error msg="failed to load cni during init, please check CRI plugin status before setting up network for pods" error="cni config load failed: no network config found in /etc/cni/net.d: c
ni plugin not initialized: failed to load cni config"
...
Oct 26 13:16:25 kind-control-plane kubelet[269]: E1026 13:16:25.870383     269 certificate_manager.go:437] Failed while requesting a signed certificate from the master: cannot create certificate signing request: Post "https://kind-control-plane:6443/apis/certificates.k8s.io/v1/cer
tificatesigningrequests": dial tcp [fc00:f853:ccd:e793::2]:6443: connect: connection refused
Oct 26 13:16:28 kind-control-plane kubelet[269]: E1026 13:16:28.023860     269 certificate_manager.go:437] Failed while requesting a signed certificate from the master: cannot create certificate signing request: Post "https://kind-control-plane:6443/apis/certificates.k8s.io/v1/cer
tificatesigningrequests": dial tcp [fc00:f853:ccd:e793::2]:6443: connect: connection refused

Is the /etc/cni/net.d part of the node image? Asking since I also have the directory on the host (for the bare-metal minkube).

Also when I look at:

# kube-system/kube-scheduler-kind-control-plane:kube-scheduler
I1026 13:16:34.423368       1 serving.go:347] Generated self-signed cert in-memory                                                                                                                                                                                                      
W1026 13:16:40.724458       1 requestheader_controller.go:193] Unable to get configmap/extension-apiserver-authentication in kube-system.  Usually fixed by 'kubectl create rolebinding -n kube-system ROLEBINDING_NAME --role=extension-apiserver-authentication-reader --serviceaccou 
W1026 13:16:40.724501       1 authentication.go:337] Error looking up in-cluster authentication configuration: configmaps "extension-apiserver-authentication" is forbidden: User "system:kube-scheduler" cannot get resource "configmaps" in API group "" in the namespace "kube-syste 
W1026 13:16:40.724514       1 authentication.go:338] Continuing without authentication configuration. This may treat all requests as anonymous.                                                                                                                                         
W1026 13:16:40.724525       1 authentication.go:339] To require authentication configuration lookup to succeed, set --authentication-tolerate-lookup-failure=false                                                                                                                      
I1026 13:16:40.755914       1 configmap_cafile_content.go:202] Starting client-ca::kube-system::extension-apiserver-authentication::client-ca-file                                                                                                                                      
I1026 13:16:40.755958       1 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file                                                                                                                         
I1026 13:16:40.756242       1 secure_serving.go:197] Serving securely on 127.0.0.1:10259                                                                                                                                                                                                
I1026 13:16:40.756320       1 tlsconfig.go:240] Starting DynamicServingCertificateController                                                                                                                                                                                            
E1026 13:16:40.757975       1 reflector.go:138] k8s.io/apiserver/pkg/server/dynamiccertificates/configmap_cafile_content.go:206: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: configmaps "extension-apiserver-authentication" is forbidden: User "system:kube-scheduler 
E1026 13:16:40.758489       1 reflector.go:138] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.PersistentVolumeClaim: failed to list *v1.PersistentVolumeClaim: persistentvolumeclaims is forbidden: User "system:kube-scheduler" cannot list resource "persistentvolum 
E1026 13:16:40.758545       1 reflector.go:138] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.ReplicationController: failed to list *v1.ReplicationController: replicationcontrollers is forbidden: User "system:kube-scheduler" cannot list resource "replicationcont 
E1026 13:16:40.758663       1 reflector.go:138] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.Service: failed to list *v1.Service: services is forbidden: User "system:kube-scheduler" cannot list resource "services" in API group "" at the cluster scope
...

Any idea about this?

ensonic avatar Oct 26 '21 14:10 ensonic

what is the output of docker ps on that host?

can you try to create a cluster with a different name kind create cluster --name testkind ?

aojea avatar Oct 30 '21 16:10 aojea

Same if I name it testkind:

kubectl --context=kind-testkind get pods --all-namespaces 
NAMESPACE            NAME                                             READY   STATUS             RESTARTS   AGE
kube-system          coredns-558bd4d5db-hcrbt                         0/1     Running            0          4m34s
kube-system          coredns-558bd4d5db-w8g6m                         0/1     Running            0          4m34s
kube-system          etcd-testkind-control-plane                      1/1     Running            0          4m44s
kube-system          kindnet-dcn6g                                    1/1     Running            0          4m34s
kube-system          kube-apiserver-testkind-control-plane            1/1     Running            0          4m44s
kube-system          kube-controller-manager-testkind-control-plane   1/1     Running            0          4m52s
kube-system          kube-proxy-ptlvk                                 1/1     Running            0          4m34s
kube-system          kube-scheduler-testkind-control-plane            1/1     Running            0          4m51s
local-path-storage   local-path-provisioner-547f784dff-9tgsg          0/1     CrashLoopBackOff   5          4m34s
docker ps | grep kindest
5eb142bfad98   kindest/node:v1.21.1                               "/usr/local/bin/entr…"   About a minute ago   Up About a minute   127.0.0.1:37423->6443/tcp   kind-control-plane

since I have minikube running too, a full docker ps would list another 66 containers.

ensonic avatar Oct 30 '21 18:10 ensonic

I can't understand this honestly, unless dns is messed, can you verify that the host matches the ip of he node?

aojea avatar Oct 30 '21 18:10 aojea

Also FYI: I am running a bare-metal minikube on the same machine (vm-driver=none).

... is there a reason for this? I don't know what all bare-metal minikube does these days, but I would not be surprised if it's related.

Does it also fail in this environment if you clean up the bare metal minikube first?

I run kind clusters on rodete all the time (hi googler!) and at the moment rodete should be fine (in the past we've had fun with things like cgroupsv2 breaking the available docker version and preventing older k8s releases from working πŸ™ƒ )

certs are handled by kubeadm and aren't anything terribly special. I suspect the bare metal minikube networking is interfering here? Probably the lookup to the API server is conflicting with the bare metal cluster on that host.

Is the /etc/cni/net.d part of the node image? Asking since I also have the directory on the host (for the bare-metal minkube).

this is something that gets written out when the networking daemon (kindnetd) starts up, we run it as a daemonset (pretty typical for CNI implementations), it is read by containerd for creating pods that don't use hostnetwork (apiserver, the networking agent, kube-proxy all use host-network, and kubelet runs on the host).

BenTheElder avatar Nov 02 '21 23:11 BenTheElder

vm-driver=none since we use k8s on appliances (need to access hw).

ensonic avatar Nov 10 '21 16:11 ensonic

vm-driver=none since we use k8s on appliances (need to access hw).

er but why this and kind? (also you can use extraMounts config to pass hw vfs to kind nodes).

I think the networking changes vm-driver=none is making is conflicting with kind, and I'm not sure this is super reasonable to debug and support?

I really don't recommend running vm-driver=none on your workstation.

BenTheElder avatar Nov 10 '21 18:11 BenTheElder

I would guess more specifically running minikube vm-driver=none is screwing up the DNS resolution and we are actually reaching the wrong api-server.

BenTheElder avatar Nov 10 '21 18:11 BenTheElder

We're using kind for tests and minikube for the developers.

ensonic avatar Apr 13 '22 09:04 ensonic

Given Kubernetes is removing dockershim support in 1.24, that approach is probably going to become its own headache for other reasons ...

Still inclined to suggest that vm-driver=none is causing the networking issues here. I doubt that kubeadm is actually generating bad certs and docker is responsible for the node names resolving correctly ordinarily.

Can you mount through the devices you need to a containerized or VMfull cluster? vm-driver=none takes over the host environment in various ways, running kubeadm init in a developer environment is generally avoided (e.g. the kubeadm team use "kinder" an extension of this project).

BenTheElder avatar Apr 13 '22 14:04 BenTheElder