democratic-csi icon indicating copy to clipboard operation
democratic-csi copied to clipboard

Talos Linux Pod is unable to attach to nvme over TCP

Open linucksrox opened this issue 1 year ago • 1 comments

I configured democratic-csi to use nvme over tcp with the zfs-generic-nvmeof driver. Up to this point, I can create a test PVC, it generates the zvol, and binds successfully with the PV. But when I try to mount the PVC in a pod the container is stuck in a ContainerCreating state, and the log shows MountVolume.MountDevice failed for volume "pvc-13b4ff6a-90ee-48e1-be6a-f011824f63c7" : rpc error: code = Unknown desc = unable to attach any nvme devices

I opened an issue in the Talos repo and there is some discussion there around what I've tried specifically: https://github.com/siderolabs/talos/issues/9255

linucksrox avatar Aug 31 '24 16:08 linucksrox

According to Talos devs, I have proven it's not an issue in Talos by running a debug container in privileged mode with /dev mounted, then installing nvme-cli and manually connecting to the NVME target over TCP from there. Tested on Talos 1.7.5 with no extra extensions besides qemu-agent.

debug-pod.yaml

apiVersion: v1
kind: Pod
metadata:
  name: debugpod
  namespace: kube-system
spec:
  hostPID: true
  containers:
  - name: debugcontainer
    image: alpine:3.20
    stdin: true
    tty: true
    securityContext:
      privileged: true
    volumeMounts:
    - name: dev-mount
      mountPath: /dev
  volumes:
  - name: dev-mount
    hostPath:
      path: /dev
  nodeSelector:
    kubernetes.io/hostname: taloswk1
kubectl apply -f debug-pod.yaml
kubectl exec -it debugpod -n kube-system -- /bin/sh
/# apk install nvme-cli
/# nvme discover -t tcp -a 10.0.50.99 -s 4420
...
/# nvme connect -t tcp -n nqn.2003-01.org.linux-nvme:default-testpvc -a 10.0.50.99 -s 4420

Kubelet logs:

10.0.50.21: {"ts":1725375269991.8162,"caller":"csi/csi_attacher.go:366","msg":"kubernetes.io/csi: attacher.MountDevice failed: rpc error: code = Unknown desc = unable to attach any nvme devices"}
10.0.50.21: {"ts":1725375269992.1475,"caller":"nestedpendingoperations/nestedpendingoperations.go:348","msg":"Operation for \"{volumeName:kubernetes.io/csi/org.democratic-csi.nvmeof^pvc-13b4ff6a-90ee-48e1-be6a-f011824f63c7 podName: nodeName:}\" failed. No retries permitted until 2024-09-03 14:54:33.992103564 +0000 UTC m=+1802105.570324602 (durationBeforeRetry 4s). Error: MountVolume.MountDevice failed for volume \"pvc-13b4ff6a-90ee-48e1-be6a-f011824f63c7\" (UniqueName: \"kubernetes.io/csi/org.democratic-csi.nvmeof^pvc-13b4ff6a-90ee-48e1-be6a-f011824f63c7\") pod \"testlogger\" (UID: \"390766e7-ae25-4651-9d1f-423260057776\") : rpc error: code = Unknown desc = unable to attach any nvme devices"}
10.0.50.21: {"ts":1725375270081.469,"caller":"machine/info.go:104","msg":"Failed to get disk map: open /sys/block/nvme1c1n1/dev: no such file or directory"}
10.0.50.21: {"ts":1725375274092.8572,"caller":"operationexecutor/operation_generator.go:622","msg":"MountVolume.WaitForAttach entering for volume \"pvc-13b4ff6a-90ee-48e1-be6a-f011824f63c7\" (UniqueName: \"kubernetes.io/csi/org.democratic-csi.nvmeof^pvc-13b4ff6a-90ee-48e1-be6a-f011824f63c7\") pod \"testlogger\" (UID: \"390766e7-ae25-4651-9d1f-423260057776\") DevicePath \"\"","v":0,"pod":{"name":"testlogger","namespace":"default"}}
10.0.50.21: {"ts":1725375274103.369,"caller":"operationexecutor/operation_generator.go:632","msg":"MountVolume.WaitForAttach succeeded for volume \"pvc-13b4ff6a-90ee-48e1-be6a-f011824f63c7\" (UniqueName: \"kubernetes.io/csi/org.democratic-csi.nvmeof^pvc-13b4ff6a-90ee-48e1-be6a-f011824f63c7\") pod \"testlogger\" (UID: \"390766e7-ae25-4651-9d1f-423260057776\") DevicePath \"csi-a89fc4a91601eb37ee03000807bb7b5676a63379db6d6dd0a50017f87702142a\"","v":0,"pod":{"name":"testlogger","namespace":"default"}}

linucksrox avatar Sep 03 '24 14:09 linucksrox

Can you send the logs from the csi-driver container? That should show the actual commands getting executed so we can see what's going on..

travisghansen avatar Jan 22 '25 04:01 travisghansen