Talos Linux Pod is unable to attach to nvme over TCP
I configured democratic-csi to use nvme over tcp with the zfs-generic-nvmeof driver. Up to this point, I can create a test PVC, it generates the zvol, and binds successfully with the PV. But when I try to mount the PVC in a pod the container is stuck in a ContainerCreating state, and the log shows MountVolume.MountDevice failed for volume "pvc-13b4ff6a-90ee-48e1-be6a-f011824f63c7" : rpc error: code = Unknown desc = unable to attach any nvme devices
I opened an issue in the Talos repo and there is some discussion there around what I've tried specifically: https://github.com/siderolabs/talos/issues/9255
According to Talos devs, I have proven it's not an issue in Talos by running a debug container in privileged mode with /dev mounted, then installing nvme-cli and manually connecting to the NVME target over TCP from there. Tested on Talos 1.7.5 with no extra extensions besides qemu-agent.
debug-pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: debugpod
namespace: kube-system
spec:
hostPID: true
containers:
- name: debugcontainer
image: alpine:3.20
stdin: true
tty: true
securityContext:
privileged: true
volumeMounts:
- name: dev-mount
mountPath: /dev
volumes:
- name: dev-mount
hostPath:
path: /dev
nodeSelector:
kubernetes.io/hostname: taloswk1
kubectl apply -f debug-pod.yaml
kubectl exec -it debugpod -n kube-system -- /bin/sh
/# apk install nvme-cli
/# nvme discover -t tcp -a 10.0.50.99 -s 4420
...
/# nvme connect -t tcp -n nqn.2003-01.org.linux-nvme:default-testpvc -a 10.0.50.99 -s 4420
Kubelet logs:
10.0.50.21: {"ts":1725375269991.8162,"caller":"csi/csi_attacher.go:366","msg":"kubernetes.io/csi: attacher.MountDevice failed: rpc error: code = Unknown desc = unable to attach any nvme devices"}
10.0.50.21: {"ts":1725375269992.1475,"caller":"nestedpendingoperations/nestedpendingoperations.go:348","msg":"Operation for \"{volumeName:kubernetes.io/csi/org.democratic-csi.nvmeof^pvc-13b4ff6a-90ee-48e1-be6a-f011824f63c7 podName: nodeName:}\" failed. No retries permitted until 2024-09-03 14:54:33.992103564 +0000 UTC m=+1802105.570324602 (durationBeforeRetry 4s). Error: MountVolume.MountDevice failed for volume \"pvc-13b4ff6a-90ee-48e1-be6a-f011824f63c7\" (UniqueName: \"kubernetes.io/csi/org.democratic-csi.nvmeof^pvc-13b4ff6a-90ee-48e1-be6a-f011824f63c7\") pod \"testlogger\" (UID: \"390766e7-ae25-4651-9d1f-423260057776\") : rpc error: code = Unknown desc = unable to attach any nvme devices"}
10.0.50.21: {"ts":1725375270081.469,"caller":"machine/info.go:104","msg":"Failed to get disk map: open /sys/block/nvme1c1n1/dev: no such file or directory"}
10.0.50.21: {"ts":1725375274092.8572,"caller":"operationexecutor/operation_generator.go:622","msg":"MountVolume.WaitForAttach entering for volume \"pvc-13b4ff6a-90ee-48e1-be6a-f011824f63c7\" (UniqueName: \"kubernetes.io/csi/org.democratic-csi.nvmeof^pvc-13b4ff6a-90ee-48e1-be6a-f011824f63c7\") pod \"testlogger\" (UID: \"390766e7-ae25-4651-9d1f-423260057776\") DevicePath \"\"","v":0,"pod":{"name":"testlogger","namespace":"default"}}
10.0.50.21: {"ts":1725375274103.369,"caller":"operationexecutor/operation_generator.go:632","msg":"MountVolume.WaitForAttach succeeded for volume \"pvc-13b4ff6a-90ee-48e1-be6a-f011824f63c7\" (UniqueName: \"kubernetes.io/csi/org.democratic-csi.nvmeof^pvc-13b4ff6a-90ee-48e1-be6a-f011824f63c7\") pod \"testlogger\" (UID: \"390766e7-ae25-4651-9d1f-423260057776\") DevicePath \"csi-a89fc4a91601eb37ee03000807bb7b5676a63379db6d6dd0a50017f87702142a\"","v":0,"pod":{"name":"testlogger","namespace":"default"}}
Can you send the logs from the csi-driver container? That should show the actual commands getting executed so we can see what's going on..