vcluster informational: k3s and nfs(or just slow?) storage

informational: k3s and nfs(or just slow?) storage

Open carlmontanari opened this issue 2 years ago • 0 comments

What happened?

When deploying a k3s vcluster into a host cluster that is running nfs as the default storage class (https://github.com/kubernetes-csi/csi-driver-nfs in my case), the vcluster will fail to come up.

Inspecting the vcluster logs shows many messages similar to:

{"level":"warn","ts":"2022-08-04T15:53:53.418Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc003a58380/kine.sock","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
time="2022-08-04T15:53:53Z" level=info msg="Tunnel server egress proxy waiting for runtime core to become available"
{"level":"warn","ts":"2022-08-04T15:53:56.419Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc003a58380/kine.sock","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
time="2022-08-04T15:53:58Z" level=info msg="Tunnel server egress proxy waiting for runtime core to become available"
{"level":"warn","ts":"2022-08-04T15:53:59.419Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc003a58380/kine.sock","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
{"level":"warn","ts":"2022-08-04T15:54:02.419Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc003a58380/kine.sock","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
time="2022-08-04T15:54:03Z" level=info msg="Tunnel server egress proxy waiting for runtime core to become available"
{"level":"warn","ts":"2022-08-04T15:54:05.419Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc003a58380/kine.sock","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
{"level":"warn","ts":"2022-08-04T15:54:08.419Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc003a58380/kine.sock","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
time="2022-08-04T15:54:08Z" level=info msg="Tunnel server egress proxy waiting for runtime core to become available"
{"level":"warn","ts":"2022-08-04T15:54:11.419Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc003a58380/kine.sock","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
time="2022-08-04T15:54:13Z" level=info msg="Tunnel server egress proxy waiting for runtime core to become available"
{"level":"warn","ts":"2022-08-04T15:54:14.422Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc003a58380/kine.sock","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
{"level":"warn","ts":"2022-08-04T15:54:17.419Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc003a58380/kine.sock","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
time="2022-08-04T15:54:18Z" level=info msg="Tunnel server egress proxy waiting for runtime core to become available"

note that the above logs may be benign (see here, specifically this comment), however they consistently show up when encountering this issue with k3s in vcluster (plus nfs)

And:

E0804 15:54:24.513119       7 autoregister_controller.go:195] v1.apps failed with : Timeout: request did not complete within requested timeout - context deadline exceeded
E0804 15:54:24.513138       7 finisher.go:175] FinishRequest: post-timeout activity - time-elapsed: 3.599µs, panicked: false, err: context deadline exceeded, panic-reason: <nil>
E0804 15:54:24.513231       7 autoregister_controller.go:195] v1.authentication.k8s.io failed with : Timeout: request did not complete within requested timeout - context deadline exceeded
E0804 15:54:24.513260       7 autoregister_controller.go:195] v1.apiextensions.k8s.io failed with : Timeout: request did not complete within requested timeout - context deadline exceeded
E0804 15:54:24.513296       7 autoregister_controller.go:195] v1.admissionregistration.k8s.io failed with : Timeout: request did not complete within requested timeout - context deadline exceeded
E0804 15:54:24.515170       7 autoregister_controller.go:195] v1. failed with : Timeout: request did not complete within requested timeout - context deadline exceeded

Replacing k3s with k0s or k8s avoids this issue. If you want to use k3s though you also have a few options:

disable storage persistence. you can do this with a values file with contents like:

storage:
  persistence: false

this will of course cause vcluster to lose its data if the pod dies (see here for more info) but is a good option for testing/dev.

change the k3s datastore endpoint to something not sqlite. etcd has seemed to reliably work even when encountering the above errors with the default datastore endpoint. again, this can be done with a values file like:

vcluster:
  extraArgs:
    - --datastore-endpoint=etcd

change your default storageclass. There is at least anecdotal evidence that longhorn works, or presumably anything other than nfs?!

Related to #45 And issue opened up in k3s here

What did you expect to happen?

vcluster to start and operate successfully

How can we reproduce it (as minimally and precisely as possible)?

deploy a vcluster on a cluster with nfs as the default storage class

Anything else we need to know?

Note that it look like sometimes the k3s cluster (on nfs, with default storage backend) does sometimes/eventually finally boot successfully. Presumably after all the controllers get registered. Probably wouldn't trust that vcluster in that state though 😁

Probably can just close this straight away, just documenting so this is hopefully easily searchable for others in the future!

Host cluster Kubernetes version

clientVersion:
  buildDate: "2022-05-24T12:26:19Z"
  compiler: gc
  gitCommit: 3ddd0f45aa91e2f30c70734b175631bec5b5825a
  gitTreeState: clean
  gitVersion: v1.24.1
  goVersion: go1.18.2
  major: "1"
  minor: "24"
  platform: darwin/amd64
kustomizeVersion: v4.5.4
serverVersion:
  buildDate: "2022-06-15T14:15:38Z"
  compiler: gc
  gitCommit: f66044f4361b9f1f96f0053dd46cb7dce5e990a8
  gitTreeState: clean
  gitVersion: v1.24.2
  goVersion: go1.18.3
  major: "1"
  minor: "24"
  platform: linux/amd64

Host cluster Kubernetes distribution

vanilla

vlcuster version

tested on 0.10.0->0.11.0

Vcluster Kubernetes distribution(k3s(default)), k8s, k0s)

k3s (see above though)

OS and Arch

OS: Ubuntu Jammy
Arch: x86

Aug 04 '22 16:08 carlmontanari

vcluster vcluster copied to clipboard

informational: k3s and nfs(or just slow?) storage

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Host cluster Kubernetes version

Host cluster Kubernetes distribution

vlcuster version

Vcluster Kubernetes distribution(k3s(default)), k8s, k0s)

OS and Arch

vcluster
vcluster copied to clipboard