kubermatic icon indicating copy to clipboard operation
kubermatic copied to clipboard

KubeVirt node eviction leaves VolumeAttachment stuck to deleted Node

Open embik opened this issue 3 years ago • 9 comments

What happened?

While testing #11736, I created a PVC to make sure that evicting a virt-launcher pod would allow me to reschedule workloads with storage within the KubeVirt user cluster.

However, I noticed that a Pod trying to mount a volume that was attached to a node evicted on the KubeVirt infra side (the node-eviction-controller drains and deletes the VM and Node object) is stuck with:

Warning  FailedAttachVolume  3m40s  attachdetach-controller  Multi-Attach error for volume "pvc-04bf24ee-a755-4bee-bbcb-559aca75d862" Volume is already exclusively attached to one node and can't be attached to another

I looked for volumeattachment resources and found this one:

NAME                                                                   ATTACHER          PV                                         NODE                                        ATTACHED   AGE
csi-6b42e564b2e31809881c86d5385e7711d0c094bb60039095d14178daabc6ecc0   csi.kubevirt.io   pvc-04bf24ee-a755-4bee-bbcb-559aca75d862   zhtjh9blrt-worker-w8z64w-5f679f4c95-68tvr   true       10m

This references a node no longer existing. Looking at the volume attachment in detail, it has a deletion timestamp and this is the status of it:

status:
    attachError:
      message: 'rpc error: code = Unknown desc = Operation cannot be fulfilled on
        virtualmachineinstance.kubevirt.io "zhtjh9blrt-worker-w8z64w-5f679f4c95-68tvr":
        Unable to add volume [pvc-04bf24ee-a755-4bee-bbcb-559aca75d862] because it
        already exists'
      time: "2023-02-09T13:17:19Z"
    attached: true
    detachError:
      message: 'rpc error: code = NotFound desc = failed to find VM with domain.firmware.uuid
        6d9a9661-0871-5893-9d13-60a352d74d6e'
      time: "2023-02-09T13:27:11Z"

Expected behavior

The volume can be re-mounted on another node since the initial pod and node both got terminated.

How to reproduce the issue?

  1. Create KubeVirt user cluster on QA.
  2. Create PVC and Pod from manifests provided below ("Provide your KKP manifests").
  3. Wait for PVC and Pod to be created, scheduled and started.
  4. Use kubectl-evict on the KubeVirt infra cluster, targeting the virt-launcher Pod that is hosting the Node that our app Pod got scheduled to.
  5. Wait for the node to be drained and a new node joining the cluster.
  6. Re-apply the Pod manifest, trying to mount the PVC that should be mountable since no other active Pod mounts it.
  7. Observe Pod not starting.

How is your environment configured?

  • KKP version: v2.22.0-alpha.0
  • Shared or separate master/seed clusters?: shared

Provide your KKP manifest here (if applicable)

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: pvc
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 4Gi
---
apiVersion: v1
kind: Pod
metadata:
  name: app
spec:
  containers:
  - name: app
    image: centos
    command: ["/bin/sh"]
    args: ["-c", "while true; do echo $(date -u) >> /data/out.txt; sleep 5; done"]
    volumeMounts:
    - name: persistent-storage
      mountPath: /data
  volumes:
  - name: persistent-storage
    persistentVolumeClaim:
      claimName: pvc

What cloud provider are you running on?

KubeVirt

What operating system are you running in your user cluster?

Ubuntu 22.04

Additional information

embik avatar Feb 09 '23 13:02 embik

I will work on the issue upstream https://github.com/kubevirt/csi-driver/issues/83, that should not block the KKP 2.22 release.

mfranczy avatar Feb 13 '23 16:02 mfranczy

Issues go stale after 90d of inactivity. After a furter 30 days, they will turn rotten. Mark the issue as fresh with /remove-lifecycle stale.

If this issue is safe to close now please do so with /close.

/lifecycle stale

kubermatic-bot avatar May 14 '23 19:05 kubermatic-bot

/remove-lifecycle stale

embik avatar May 15 '23 05:05 embik

Issues go stale after 90d of inactivity. After a furter 30 days, they will turn rotten. Mark the issue as fresh with /remove-lifecycle stale.

If this issue is safe to close now please do so with /close.

/lifecycle stale

kubermatic-bot avatar Aug 31 '23 11:08 kubermatic-bot

/remove-lifecycle stale

embik avatar Sep 01 '23 06:09 embik

Issues go stale after 90d of inactivity. After a furter 30 days, they will turn rotten. Mark the issue as fresh with /remove-lifecycle stale.

If this issue is safe to close now please do so with /close.

/lifecycle stale

kubermatic-bot avatar Feb 02 '24 00:02 kubermatic-bot

/remove-lifecycle stale

embik avatar Feb 02 '24 06:02 embik

/remove-priority high

csengerszabo avatar Aug 22 '24 12:08 csengerszabo

/milestone clear

csengerszabo avatar Aug 22 '24 12:08 csengerszabo