Kind cluster fails to provision PV when a USB device was removed from the machine
What happened:
I'm running Kind (with export KIND_EXPERIMENTAL_PROVIDER=podman) on my laptop. When I start the cluster while a mouse is connected to the machine, I'm able to create a pod with a local volume. Once I remove that mouse, this starts to fail.
The same issue happens when I close the lid to have the laptop go to sleep, and then wake it up again.
What you expected to happen:
Setup of PVCs and PVs continues to work.
How to reproduce it (as minimally and precisely as possible):
export KIND_EXPERIMENTAL_PROVIDER=podmanlsusbreturns something like
Bus 004 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
Bus 003 Device 003: ID 13d3:5405 IMC Networks Integrated Camera
Bus 003 Device 044: ID 06cb:00f9 Synaptics, Inc.
Bus 003 Device 046: ID 0458:0007 KYE Systems Corp. (Mouse Systems) Trackbar Emotion
Bus 003 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
kind create cluster- Have a YAML file duplicating the
standardstorageclass under the namelocal-path, something likecat storageclass-local-path.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: local-path
namespace: kube-system
annotations:
storageclass.kubernetes.io/is-default-class: "false"
provisioner: rancher.io/local-path
volumeBindingMode: WaitForFirstConsumer
reclaimPolicy: Delete
kubectl apply -f storageclass-local-path.yamlkubectl apply -k 'https://github.com/rancher/local-path-provisioner/examples/pod-with-local-volume'- After a small while,
kubectl get pods -Ashowvolume-testin namespacedefaultas Running. kubectl delete -k 'https://github.com/rancher/local-path-provisioner/examples/pod-with-local-volume'- Disconnect that USB mouse.
- Check with
lsusbthat the device003/046or whatever ids it had is no longer there. kubectl apply -k 'https://github.com/rancher/local-path-provisioner/examples/pod-with-local-volume'kubectl get pods -Ashows
NAMESPACE NAME READY STATUS RESTARTS AGE
default volume-test 0/1 Pending 0 9s
[...]
local-path-storage helper-pod-create-pvc-1e7e0729-1ec4-4b0e-91ef-3c41e0495783 0/1 StartError 0 9s
kubectl events -n local-path-storage deployment/local-path-provisionershows
42s Warning Failed Pod/helper-pod-create-pvc-1e7e0729-1ec4-4b0e-91ef-3c41e0495783 Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error creating device nodes: mount /dev/bus/usb/003/046:/run/containerd/io.containerd.runtime.v2.task/k8s.io/helper-pod/rootfs/dev/bus/usb/003/046 (via /proc/self/fd/6), flags: 0x1000: no such file or directory: unknown
Anything else we need to know?:
I actually first encountered it when I suspended the laptop and then woken it up and wanted to continue using the Kind cluster.
The Bus 003 Device 044: ID 06cb:00f9 Synaptics, Inc. device gets a different device id upon wakeup.
Environment:
- kind version: (use
kind version):kind v0.20.0 go1.20.4 linux/amd64 - Runtime info: (use
docker infoorpodman info):
host:
arch: amd64
buildahVersion: 1.32.0
cgroupControllers:
- cpuset
- cpu
- io
- memory
- pids
cgroupManager: systemd
cgroupVersion: v2
conmon:
package: conmon-2.1.7-2.fc38.x86_64
path: /usr/bin/conmon
version: 'conmon version 2.1.7, commit: '
cpuUtilization:
idlePercent: 70.31
systemPercent: 6.54
userPercent: 23.15
cpus: 8
databaseBackend: boltdb
distribution:
distribution: fedora
variant: xfce
version: "38"
eventLogger: journald
freeLocks: 2038
hostname: machine.example.com
idMappings:
gidmap:
- container_id: 0
host_id: 2000
size: 1
- container_id: 1
host_id: 524288
size: 65536
uidmap:
- container_id: 0
host_id: 2000
size: 1
- container_id: 1
host_id: 524288
size: 65536
kernel: 6.5.6-200.fc38.x86_64
linkmode: dynamic
logDriver: journald
memFree: 8981233664
memTotal: 33331113984
networkBackend: netavark
networkBackendInfo:
backend: netavark
dns:
package: aardvark-dns-1.8.0-1.fc38.x86_64
path: /usr/libexec/podman/aardvark-dns
version: aardvark-dns 1.8.0
package: netavark-1.8.0-2.fc38.x86_64
path: /usr/libexec/podman/netavark
version: netavark 1.8.0
ociRuntime:
name: crun
package: crun-1.9.2-1.fc38.x86_64
path: /usr/bin/crun
version: |-
crun version 1.9.2
commit: 35274d346d2e9ffeacb22cc11590b0266a23d634
rundir: /run/user/2000/crun
spec: 1.0.0
+SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +LIBKRUN +WASM:wasmedge +YAJL
os: linux
pasta:
executable: /usr/bin/pasta
package: passt-0^20231004.gf851084-1.fc38.x86_64
version: |
pasta 0^20231004.gf851084-1.fc38.x86_64
Copyright Red Hat
GNU General Public License, version 2 or later
<https://www.gnu.org/licenses/old-licenses/gpl-2.0.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
remoteSocket:
exists: false
path: /run/user/2000/podman/podman.sock
security:
apparmorEnabled: false
capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
rootless: true
seccompEnabled: true
seccompProfilePath: /usr/share/containers/seccomp.json
selinuxEnabled: true
serviceIsRemote: false
slirp4netns:
executable: /usr/bin/slirp4netns
package: slirp4netns-1.2.1-1.fc38.x86_64
version: |-
slirp4netns version 1.2.1
commit: 09e31e92fa3d2a1d3ca261adaeb012c8d75a8194
libslirp: 4.7.0
SLIRP_CONFIG_VERSION_MAX: 4
libseccomp: 2.5.3
swapFree: 8589877248
swapTotal: 8589930496
uptime: 202h 32m 16.00s (Approximately 8.42 days)
plugins:
authorization: null
log:
- k8s-file
- none
- passthrough
- journald
network:
- bridge
- macvlan
- ipvlan
volume:
- local
registries:
search:
- registry.fedoraproject.org
- registry.access.redhat.com
- docker.io
- quay.io
store:
configFile: /home/kind/.config/containers/storage.conf
containerStore:
number: 1
paused: 0
running: 1
stopped: 0
graphDriverName: overlay
graphOptions: {}
graphRoot: /home/kind/.local/share/containers/storage
graphRootAllocated: 26241896448
graphRootUsed: 11933265920
graphStatus:
Backing Filesystem: extfs
Native Overlay Diff: "true"
Supports d_type: "true"
Supports shifting: "false"
Supports volatile: "true"
Using metacopy: "false"
imageCopyTmpDir: /var/tmp
imageStore:
number: 94
runRoot: /tmp/containers-user-2000/containers
transientStore: false
volumePath: /home/kind/.local/share/containers/storage/volumes
version:
APIVersion: 4.7.0
Built: 1695839078
BuiltTime: Wed Sep 27 20:24:38 2023
GitCommit: ""
GoVersion: go1.20.8
Os: linux
OsArch: linux/amd64
Version: 4.7.0
- OS (e.g. from
/etc/os-release):CPE_NAME="cpe:/o:fedoraproject:fedora:38" - Kubernetes version: (use
kubectl version):
WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short. Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.9", GitCommit:"d1483fdf7a0578c83523bc1e2212a606a44fd71d", GitTreeState:"archive", BuildDate:"2023-09-16T00:00:00Z", GoVersion:"go1.20.8", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.3", GitCommit:"25b4e43193bcda6c7328a6d147b1fb73a33f1598", GitTreeState:"clean", BuildDate:"2023-06-15T00:36:28Z", GoVersion:"go1.20.5", Compiler:"gc", Platform:"linux/amd64"}
- Any proxies or other special environment settings?:
KIND_EXPERIMENTAL_PROVIDER=podman
I don't have very clear from the description ... is an error from the local-path-provisioner or is any pod in kind that does not work?
The error comes from containerd attempting to start the helper-pod-create-pvc-1e7e0729-1ec4-4b0e-91ef-3c41e0495783 that gets initiated by the local-path-provisioner-6bc4bddd6b-rnsqd to fulfill the PVC request that comes from https://github.com/rancher/local-path-provisioner/blob/master/examples/pvc-with-local-volume/pvc.yaml.
is a https://github.com/rancher/local-path-provisioner bug then?
I don't think the code in local-path-provisioner does much with setting up the root fs and the mount points for the pod.
This seems to be related to how the "nodes" are created and represented by Kind / init / containerd / something and what they assume and inherit.
I don't have very clear from the description ... is an error from the local-path-provisioner or is any pod in kind that does not work?
That is why I asked this, is this with any pod or only with this specific pod?
Ah, you meant if there is something wrong about that specific example? Not really, when I turn it into a trivial busybox container with
apiVersion: v1
kind: Pod
metadata:
name: volume-test-2
spec:
containers:
- name: volume-test-2
image: busybox
imagePullPolicy: IfNotPresent
command:
- mount
volumeMounts:
- name: volv2
mountPath: /data2
volumes:
- name: volv2
persistentVolumeClaim:
claimName: local-volume-pvc-2
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: local-volume-pvc-2
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Mi
I get the very same error message once the list of USB devices changes.
what I'm trying to understand is if it is a general problem or only happens because of the PersistentVolumes
I only saw it with that helper pod. When I apply a pod without any volumes
apiVersion: v1
kind: Pod
metadata:
name: no-volume
spec:
containers:
- name: no-volume
image: busybox
imagePullPolicy: IfNotPresent
command:
- mount
the pod and container get created and run fine. The mount output shows a very limited set of things mounted under /dev/ in that case:
$ kubectl logs pod/no-volume | grep ' on /dev/'
devpts on /dev/pts type devpts (rw,seclabel,nosuid,noexec,relatime,gid=524292,mode=620,ptmxmode=666)
mqueue on /dev/mqueue type mqueue (rw,seclabel,nosuid,nodev,noexec,relatime)
/dev/mapper/vg_machine-lv_containers on /dev/termination-log type ext4 (rw,seclabel,relatime)
shm on /dev/shm type tmpfs (rw,seclabel,nosuid,nodev,noexec,relatime,size=65536k,uid=2000,gid=2000,inode64)
devtmpfs on /dev/null type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
devtmpfs on /dev/random type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
devtmpfs on /dev/full type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
devtmpfs on /dev/tty type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
devtmpfs on /dev/zero type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
devtmpfs on /dev/urandom type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
To debug, when I
kubectl edit -n local-path-storage cm local-path-config
and change image to busybox and add a mount and sleep to setup with
apiVersion: v1
kind: Pod
metadata:
name: helper-pod
spec:
containers:
- name: helper-pod
image: busybox
imagePullPolicy: IfNotPresent
setup: |-
#!/bin/sh
set -eu
mount
sleep 30
mkdir -m 0777 -p "$VOL_DIR"
and
kubectl rollout restart deployment local-path-provisioner -n local-path-storage
provisioning the pod with a PVC shows huge number of bind (?) mounts:
kubectl logs -n local-path-storage helper-pod-create-pvc-59b95912-a254-454b-b26b-889c10b217c6 | grep ' on /dev/'
devpts on /dev/pts type devpts (rw,seclabel,nosuid,noexec,relatime,gid=524292,mode=620,ptmxmode=666)
mqueue on /dev/mqueue type mqueue (rw,seclabel,nosuid,nodev,noexec,relatime)
/dev/mapper/vg_machine-lv_containers on /dev/termination-log type ext4 (rw,seclabel,relatime)
shm on /dev/shm type tmpfs (rw,seclabel,nosuid,nodev,noexec,relatime,size=65536k,uid=2000,gid=2000,inode64)
devtmpfs on /dev/acpi_thermal_rel type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
devtmpfs on /dev/autofs type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
devtmpfs on /dev/btrfs-control type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
devtmpfs on /dev/bus/usb/001/001 type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
devtmpfs on /dev/bus/usb/002/001 type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
devtmpfs on /dev/bus/usb/003/001 type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
devtmpfs on /dev/bus/usb/003/003 type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
devtmpfs on /dev/bus/usb/003/050 type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
devtmpfs on /dev/bus/usb/004/001 type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
devtmpfs on /dev/cpu/0/cpuid type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
devtmpfs on /dev/cpu/0/msr type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
devtmpfs on /dev/cpu/1/cpuid type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
devtmpfs on /dev/cpu/1/msr type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
devtmpfs on /dev/cpu/2/cpuid type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
devtmpfs on /dev/cpu/2/msr type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
devtmpfs on /dev/cpu/3/cpuid type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
devtmpfs on /dev/cpu/3/msr type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
devtmpfs on /dev/cpu/4/cpuid type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
[...]
devtmpfs on /dev/watchdog type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
devtmpfs on /dev/watchdog0 type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
devtmpfs on /dev/zero type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
devtmpfs on /dev/zram0 type devtmpfs (rw,seclabel,nosuid,noexec,size=4096k,nr_inodes=4062748,mode=755,inode64)
So something is different between the "normal" pods/containers and the pod/container created as the helper for the local-path provisioner.
We don't control the device mounts being propagated from the host to the "node", that's podman.
The helper pod is privileged which is why it is also seeing all the mounts, unlike your simple test pod. https://github.com/rancher/local-path-provisioner/blob/4d42c70e748fed13cd66f86656e909184a5b08d2/provisioner.go#L553
Thanks for that pointer -- I confirm that when I add
securityContext:
privileged: true
to my regular container, I get the same issues as with the local-path helper.
What I'd like to figure out though: you say "we don't control the device mounts being propagated from the host to the "node"". But in this case it is not propagation of the device mounts from the host because on the host the /dev/bus/usb/*/* device is no longer there. So it is being propagated from something else, possibly some parent (?) pod (?) that has a list of devices that it once saw?
IIRC docker/podman will sync all the /dev entries on creating the container, but there is not mount propagation to reflect updated entries. Then the nested containerd/runc will try to create these for the "inner" pod containers.
I don't think there are great solutions here ... maybe we can find a way to detect these "dangling" mounts and remove them from the node or hook the inner runc.
FWIW kind clusters are meant to be disposable and quick to create so maybe recreate after changing devices :/
The opposite is a known issue with docker: "privileged containers do not reflect newly added host devices" has been a longstanding issue as I recall. We should look at what workarounds people are using for this since it's more or less the same root issue: https://github.com/moby/moby/issues/16160
Well realistically I'd be OK to just disable any propagation of /dev/bus/usb to the containers, either the first one (podman), or the next layer (containerd?). Is the search for the devices somehow configurable in either of those cases?
Well realistically I'd be OK to just disable any propagation of /dev/bus/usb to the containers, either the first one (podman), or the next layer (containerd?). Is the search for the devices somehow configurable in either of those cases?
No, we're not even telling podman/docker to pass through these to the node, it's implicit with --privileged which we need to run Kubernetes/containerd.
Ditto with the privileged pods. Everything under dev gets passed through IIRC*
* a TTY for the container may be setup specially.
So with some experimentation, I got the setup working with
--- a/images/base/files/etc/containerd/config.toml
+++ b/images/base/files/etc/containerd/config.toml
@@ -19,6 +19,9 @@ version = 2
runtime_type = "io.containerd.runc.v2"
# Generated by "ctr oci spec" and modified at base container to mount poduct_uuid
base_runtime_spec = "/etc/containerd/cri-base.json"
+
+ privileged_without_host_devices = true
+
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
# use systemd cgroup by default
SystemdCgroup = true
and rebuilding the base and node images.
I tested it with a rootless podman and both pods with PVs and running a privileged pod works, both with the USB unplug use-case and suspending the laptop and waking it up. I did not try any additional tests to see what this might break. If I file this as a pull request, will you allow the tests to run to see what it discovers in the general Kind testing / CI?
Now the questions is if / how to make this available in Kind in general, what the default should be, and what mechanism to provide for people to override it.
Given not having those devices in the privileged containers seems like a safer default, and with https://github.com/moby/moby/issues/16160 unaddressed hotplugging of devices does not work with docker anyway, I'd lean towards having true (no host devices) as the default.
But what should people use to override it?
Mounting the config.toml via extraMounts does not work because it gets manipulated at least in https://github.com/kubernetes-sigs/kind/blob/main/images/base/files/usr/local/bin/entrypoint.
We could add another KIND_EXPERIMENTAL_CONTAINERD_ variable and amend that sed -i logic to use it.
We could also use
imports = ["/etc/containerd/config.d/*.toml"]
and document extraMounts-ing any overrides into that directory. In fact, the configure_containerd in https://github.com/kubernetes-sigs/kind/blob/main/images/base/files/usr/local/bin/entrypoint could use that mechanism instead of that sed -i approach as well.
I don't want to make a change like moving from that sed -i to drop-in snippets just for this device-mounting issue ... but I'd be happy to provide a PR do switch to the drop-in snippets approach if it is viewed as a useful approach in general.
Now the questions is if / how to make this available in Kind in general, what the default should be, and what mechanism to provide for people to override it.
I suspect this would break a LOT of users doing interesting driver development.
Given not having those devices in the privileged containers seems like a safer default, and with https://github.com/moby/moby/issues/16160 unaddressed hotplugging of devices does not work with docker anyway, I'd lean towards having true (no host devices) as the default.
I'm fairly certain this would break standard kubernetes tests.
You can configure this for your clusters today though with the poorly documented containerdConfigPatch https://kind.sigs.k8s.io/docs/user/private-registries/#use-a-certificate
Ah, great.
I confirm that with
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
[...]
containerdConfigPatches:
- |-
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
privileged_without_host_devices = true
things work just fine.
I'm closing this issue as I have a way to address the problem I've been hitting. If you think that exposing this in some way (possibly in documentation?) might be helpful to others, let me know.
I'd like to reopen this if you don't mind because I know other users are going to hit this and requiring the workaround config is still unfortunate.
We should probably add a "known issues" page entry to start with a pointer to this configuration and continue to track this while we consider options to automatically mitigate.
I think it will be pretty involved to implement but ideally we'd just trim missing entries.
Actually, in the docker issue there's a suggestion to just bind mount /dev explicitly to avoid this behavior? 👀
https://github.com/moby/moby/issues/16160#issuecomment-551388571
We can test this with extraMounts hostPath: /dev containerPath: /dev
I confirm that with
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
extraMounts:
- hostPath: /dev
containerPath: /dev
the problem is gone as well.
After the removal of the USB mouse, the device node gets removed from host's /dev/bus/usb/003/ and it is no longer shown in
podman exec kind-control-plane mount | grep ' on /dev'
and creating a pod with a privileged container passes as well.
With this approach, I would just be concerned about implications on /dev/tty and similar non-global, per process devices.
With this approach, I would just be concerned about implications on /dev/tty and similar non-global, per process devices.
/dev/tty at least I'm pretty sure gets specially setup in run regardless, but I share that concern, I'd want to carefully investigate before doing this by default, but it seems like this might be sufficient