vcluster
vcluster copied to clipboard
informational: k3s and nfs(or just slow?) storage
What happened?
When deploying a k3s vcluster into a host cluster that is running nfs as the default storage class (https://github.com/kubernetes-csi/csi-driver-nfs in my case), the vcluster will fail to come up.
Inspecting the vcluster logs shows many messages similar to:
{"level":"warn","ts":"2022-08-04T15:53:53.418Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc003a58380/kine.sock","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
time="2022-08-04T15:53:53Z" level=info msg="Tunnel server egress proxy waiting for runtime core to become available"
{"level":"warn","ts":"2022-08-04T15:53:56.419Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc003a58380/kine.sock","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
time="2022-08-04T15:53:58Z" level=info msg="Tunnel server egress proxy waiting for runtime core to become available"
{"level":"warn","ts":"2022-08-04T15:53:59.419Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc003a58380/kine.sock","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
{"level":"warn","ts":"2022-08-04T15:54:02.419Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc003a58380/kine.sock","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
time="2022-08-04T15:54:03Z" level=info msg="Tunnel server egress proxy waiting for runtime core to become available"
{"level":"warn","ts":"2022-08-04T15:54:05.419Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc003a58380/kine.sock","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
{"level":"warn","ts":"2022-08-04T15:54:08.419Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc003a58380/kine.sock","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
time="2022-08-04T15:54:08Z" level=info msg="Tunnel server egress proxy waiting for runtime core to become available"
{"level":"warn","ts":"2022-08-04T15:54:11.419Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc003a58380/kine.sock","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
time="2022-08-04T15:54:13Z" level=info msg="Tunnel server egress proxy waiting for runtime core to become available"
{"level":"warn","ts":"2022-08-04T15:54:14.422Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc003a58380/kine.sock","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
{"level":"warn","ts":"2022-08-04T15:54:17.419Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc003a58380/kine.sock","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
time="2022-08-04T15:54:18Z" level=info msg="Tunnel server egress proxy waiting for runtime core to become available"
note that the above logs may be benign (see here, specifically this comment), however they consistently show up when encountering this issue with k3s in vcluster (plus nfs)
And:
E0804 15:54:24.513119 7 autoregister_controller.go:195] v1.apps failed with : Timeout: request did not complete within requested timeout - context deadline exceeded
E0804 15:54:24.513138 7 finisher.go:175] FinishRequest: post-timeout activity - time-elapsed: 3.599µs, panicked: false, err: context deadline exceeded, panic-reason: <nil>
E0804 15:54:24.513231 7 autoregister_controller.go:195] v1.authentication.k8s.io failed with : Timeout: request did not complete within requested timeout - context deadline exceeded
E0804 15:54:24.513260 7 autoregister_controller.go:195] v1.apiextensions.k8s.io failed with : Timeout: request did not complete within requested timeout - context deadline exceeded
E0804 15:54:24.513296 7 autoregister_controller.go:195] v1.admissionregistration.k8s.io failed with : Timeout: request did not complete within requested timeout - context deadline exceeded
E0804 15:54:24.515170 7 autoregister_controller.go:195] v1. failed with : Timeout: request did not complete within requested timeout - context deadline exceeded
Replacing k3s with k0s or k8s avoids this issue. If you want to use k3s though you also have a few options:
- disable storage persistence. you can do this with a values file with contents like:
storage:
persistence: false
this will of course cause vcluster to lose its data if the pod dies (see here for more info) but is a good option for testing/dev.
- change the k3s datastore endpoint to something not sqlite. etcd has seemed to reliably work even when encountering the above errors with the default datastore endpoint. again, this can be done with a values file like:
vcluster:
extraArgs:
- --datastore-endpoint=etcd
- change your default storageclass. There is at least anecdotal evidence that longhorn works, or presumably anything other than nfs?!
Related to #45 And issue opened up in k3s here
What did you expect to happen?
vcluster to start and operate successfully
How can we reproduce it (as minimally and precisely as possible)?
deploy a vcluster on a cluster with nfs as the default storage class
Anything else we need to know?
Note that it look like sometimes the k3s cluster (on nfs, with default storage backend) does sometimes/eventually finally boot successfully. Presumably after all the controllers get registered. Probably wouldn't trust that vcluster in that state though 😁
Probably can just close this straight away, just documenting so this is hopefully easily searchable for others in the future!
Host cluster Kubernetes version
clientVersion:
buildDate: "2022-05-24T12:26:19Z"
compiler: gc
gitCommit: 3ddd0f45aa91e2f30c70734b175631bec5b5825a
gitTreeState: clean
gitVersion: v1.24.1
goVersion: go1.18.2
major: "1"
minor: "24"
platform: darwin/amd64
kustomizeVersion: v4.5.4
serverVersion:
buildDate: "2022-06-15T14:15:38Z"
compiler: gc
gitCommit: f66044f4361b9f1f96f0053dd46cb7dce5e990a8
gitTreeState: clean
gitVersion: v1.24.2
goVersion: go1.18.3
major: "1"
minor: "24"
platform: linux/amd64
Host cluster Kubernetes distribution
vanilla
vlcuster version
tested on 0.10.0->0.11.0
Vcluster Kubernetes distribution(k3s(default)), k8s, k0s)
k3s (see above though)
OS and Arch
OS: Ubuntu Jammy
Arch: x86