csi-driver icon indicating copy to clipboard operation
csi-driver copied to clipboard

VolumeFailedDelete events

Open tcldr opened this issue 1 month ago • 3 comments

TL;DR

CSI controller is generating frequent 'VolumeFailedDelete' warning events with the message 'persistentvolume pvc- is still attached to node '

Expected behavior

Persistent volumes attach/detach smoothly

Observed behavior

When attaching/detaching persistent volumes, particularly for high-churn use cases both the CSI controller and PV dependent pods are producing lots of warnings and adding latency into the cluster.

Minimal working example

Try setting up ARC (Actions Runner Controller for GitHub actions) in K8s mode and triggering a few workflows:

---
apiVersion: v1
kind: Namespace
metadata:
  name: arc-runners
---
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: arc-runner-scale-set
  namespace: arc-runners
spec:
  interval: 30m
  chart:
    spec:
      chart: gha-runner-scale-set
      sourceRef:
        kind: HelmRepository
        name: arc
        namespace: arc-systems
      interval: 12h
      version: 0.13.0
  values:
    maxRunners: 2
    minRunners: 1
    githubConfigUrl: https://github.com/<YOUR REPO>
    githubConfigSecret: <YOUR REPO SECRET>
    runnerScaleSetName: self-hosted-runner
    containerMode:
      type: kubernetes
      kubernetesModeWorkVolumeClaim:
        accessModes: ["ReadWriteOnce"]
        storageClassName: hcloud-volumes
        resources:
          requests:
            storage: 1Gi
    template:
      spec:
        securityContext:
          fsGroup: 1001
        containers:
        - name: runner
          image: ghcr.io/actions/actions-runner:latest
          command: ["/home/runner/run.sh"]

Log output

apiVersion: v1
kind: Event
metadata:
  name: pvc-efa81f1e-869e-49dc-99f4-9390cd00758a.187a66094e8c405a
  namespace: default
  uid: 323dca32-2ebb-4f30-9877-a2ebf07dff8b
  resourceVersion: '55312424'
  creationTimestamp: '2025-11-22T17:59:54Z'
  selfLink: >-
    /api/v1/namespaces/default/events/pvc-efa81f1e-869e-49dc-99f4-9390cd00758a.187a66094e8c405a
spec: {}
involvedObject:
  kind: PersistentVolume
  name: pvc-efa81f1e-869e-49dc-99f4-9390cd00758a
  uid: ad3c9a95-0843-4e7f-a6c6-3a80a2255d4a
  apiVersion: v1
  resourceVersion: '55312391'
reason: VolumeFailedDelete
message: >-
  persistentvolume pvc-efa81f1e-869e-49dc-99f4-9390cd00758a is still attached to
  node cloud-02
source:
  component: >-
    csi.hetzner.cloud_hcloud-csi-controller-567f89966f-4x2w5_2950c7a3-f602-4696-95f9-d014acc08356
firstTimestamp: '2025-11-22T17:59:54Z'
lastTimestamp: '2025-11-22T18:00:01Z'
count: 4
type: Warning
eventTime: null
reportingComponent: >-
  csi.hetzner.cloud_hcloud-csi-controller-567f89966f-4x2w5_2950c7a3-f602-4696-95f9-d014acc08356
reportingInstance: ''
---
apiVersion: v1
kind: Event
metadata:
  name: my-cloud-tp9sg-runner-hhjpx.187a65295d0af25f
  namespace: arc-runners
  uid: e11e7370-37cb-4860-886c-8180d3616367
  resourceVersion: '55308459'
  creationTimestamp: '2025-11-22T17:43:52Z'
  selfLink: >-
    /api/v1/namespaces/arc-runners/events/my-cloud-tp9sg-runner-hhjpx.187a65295d0af25f
spec: {}
involvedObject:
  kind: Pod
  namespace: arc-runners
  name: my-cloud-tp9sg-runner-hhjpx
  uid: 9a871964-3534-4f2f-b9c2-97668b05fdf5
  apiVersion: v1
  resourceVersion: '55308341'
reason: FailedMount
message: >-
  MountVolume.SetUp failed for volume "pvc-efa81f1e-869e-49dc-99f4-9390cd00758a"
  : rpc error: code = Internal desc = failed to publish volume: format of disk
  "/dev/disk/by-id/scsi-0HC_Volume_104028078" failed: type:("ext4")
  target:("/var/lib/kubelet/pods/9a871964-3534-4f2f-b9c2-97668b05fdf5/volumes/kubernetes.io~csi/pvc-efa81f1e-869e-49dc-99f4-9390cd00758a/mount")
  options:("defaults") errcode:(exit status 1) output:(mke2fs 1.47.1
  (20-May-2024)

  The file /dev/disk/by-id/scsi-0HC_Volume_104028078 does not exist and no size
  was specified.

  ) 
source:
  component: kubelet
  host: cloud-02
firstTimestamp: '2025-11-22T17:43:52Z'
lastTimestamp: '2025-11-22T17:43:56Z'
count: 4
type: Warning
eventTime: null
reportingComponent: kubelet
reportingInstance: cloud-02
---
apiVersion: v1
kind: Event
metadata:
  name: my-cloud-tp9sg-runner-hhjpx-work.187a652518b33d76
  namespace: arc-runners
  uid: c05af831-89e3-4380-91cf-4f7ac3718f9f
  resourceVersion: '55308317'
  creationTimestamp: '2025-11-22T17:43:34Z'
  selfLink: >-
    /api/v1/namespaces/arc-runners/events/my-cloud-tp9sg-runner-hhjpx-work.187a652518b33d76
spec: {}
involvedObject:
  kind: PersistentVolumeClaim
  namespace: arc-runners
  name: my-cloud-tp9sg-runner-hhjpx-work
  uid: efa81f1e-869e-49dc-99f4-9390cd00758a
  apiVersion: v1
  resourceVersion: '55308263'
reason: ProvisioningFailed
message: >-
  failed to provision volume with StorageClass "hcloud-volumes": rpc error: code
  = DeadlineExceeded desc = context deadline exceeded
source:
  component: >-
    csi.hetzner.cloud_hcloud-csi-controller-567f89966f-4x2w5_2950c7a3-f602-4696-95f9-d014acc08356
firstTimestamp: '2025-11-22T17:43:34Z'
lastTimestamp: '2025-11-22T17:43:34Z'
count: 1
type: Warning
eventTime: null
reportingComponent: >-
  csi.hetzner.cloud_hcloud-csi-controller-567f89966f-4x2w5_2950c7a3-f602-4696-95f9-d014acc08356
reportingInstance: ''
---
apiVersion: v1
kind: Event
metadata:
  name: my-cloud-tp9sg-runner-r7728.187a65228e26c5c3
  namespace: arc-runners
  uid: 6696bc08-f201-433e-af85-6207aa913412
  resourceVersion: '55308253'
  creationTimestamp: '2025-11-22T17:43:23Z'
  selfLink: >-
    /api/v1/namespaces/arc-runners/events/my-cloud-tp9sg-runner-r7728.187a65228e26c5c3
spec: {}
involvedObject:
  kind: Pod
  namespace: arc-runners
  name: my-cloud-tp9sg-runner-r7728
  uid: 67965bc8-169f-4229-b15d-dd065a687968
  apiVersion: v1
  resourceVersion: '55308196'
reason: FailedScheduling
message: >-
  running PreBind plugin "VolumeBinding": binding volumes: pod does not exist
  any more: pod "my-cloud-tp9sg-runner-r7728" not found
source:
  component: default-scheduler
firstTimestamp: '2025-11-22T17:43:23Z'
lastTimestamp: '2025-11-22T17:43:23Z'
count: 1
type: Warning
eventTime: null
reportingComponent: default-scheduler
reportingInstance: ''
---
apiVersion: v1
kind: Event
metadata:
  name: my-cloud-tp9sg-runner-hhjpx.187a6522647ba8dc
  namespace: arc-runners
  uid: 2b1b99f7-5e0c-4032-80e3-0115f50b1c13
  resourceVersion: '55308215'
  creationTimestamp: '2025-11-22T17:43:22Z'
  selfLink: >-
    /api/v1/namespaces/arc-runners/events/my-cloud-tp9sg-runner-hhjpx.187a6522647ba8dc
spec: {}
involvedObject:
  kind: Pod
  namespace: arc-runners
  name: my-cloud-tp9sg-runner-hhjpx
  uid: 9a871964-3534-4f2f-b9c2-97668b05fdf5
  apiVersion: v1
  resourceVersion: '55308214'
reason: FailedScheduling
message: >-
  0/3 nodes are available: waiting for ephemeral volume controller to create the
  persistentvolumeclaim "my-cloud-tp9sg-runner-hhjpx-work". preemption: 0/3
  nodes are available: 3 Preemption is not helpful for scheduling.
source:
  component: default-scheduler
firstTimestamp: '2025-11-22T17:43:22Z'
lastTimestamp: '2025-11-22T17:43:22Z'
count: 1
type: Warning
eventTime: null
reportingComponent: default-scheduler
reportingInstance: ''
---
apiVersion: v1
kind: Event
metadata:
  name: pvc-a4919e2d-32ac-4aed-9f3d-a9c189e5e60c.187a650de6d88b55
  namespace: default
  uid: 3be65217-10dd-4c8b-aaa0-faf4155c4174
  resourceVersion: '55307693'
  creationTimestamp: '2025-11-22T17:41:54Z'
  selfLink: >-
    /api/v1/namespaces/default/events/pvc-a4919e2d-32ac-4aed-9f3d-a9c189e5e60c.187a650de6d88b55
spec: {}
involvedObject:
  kind: PersistentVolume
  name: pvc-a4919e2d-32ac-4aed-9f3d-a9c189e5e60c
  uid: 5e2fd3f0-f078-4586-98d1-9318c94fda83
  apiVersion: v1
  resourceVersion: '55307658'
reason: VolumeFailedDelete
message: >-
  persistentvolume pvc-a4919e2d-32ac-4aed-9f3d-a9c189e5e60c is still attached to
  node cloud-02
source:
  component: >-
    csi.hetzner.cloud_hcloud-csi-controller-567f89966f-4x2w5_2950c7a3-f602-4696-95f9-d014acc08356
firstTimestamp: '2025-11-22T17:41:54Z'
lastTimestamp: '2025-11-22T17:42:01Z'
count: 4
type: Warning
eventTime: null
reportingComponent: >-
  csi.hetzner.cloud_hcloud-csi-controller-567f89966f-4x2w5_2950c7a3-f602-4696-95f9-d014acc08356
reportingInstance: ''
---
apiVersion: v1
kind: Event
metadata:
  name: my-cloud-jf8z8-runner-dt6kp-work.187a650834cdd775
  namespace: arc-runners
  uid: 23fd02b0-1b22-428a-a57a-ef8468c47ec0
  resourceVersion: '55307476'
  creationTimestamp: '2025-11-22T17:41:29Z'
  selfLink: >-
    /api/v1/namespaces/arc-runners/events/my-cloud-jf8z8-runner-dt6kp-work.187a650834cdd775
spec: {}
involvedObject:
  kind: PersistentVolumeClaim
  namespace: arc-runners
  name: my-cloud-jf8z8-runner-dt6kp-work
  uid: bf39e236-c4aa-452d-873b-7539accc100f
  apiVersion: v1
  resourceVersion: '55307423'
reason: ProvisioningFailed
message: >-
  failed to provision volume with StorageClass "hcloud-volumes": rpc error: code
  = DeadlineExceeded desc = context deadline exceeded
source:
  component: >-
    csi.hetzner.cloud_hcloud-csi-controller-567f89966f-4x2w5_2950c7a3-f602-4696-95f9-d014acc08356
firstTimestamp: '2025-11-22T17:41:29Z'
lastTimestamp: '2025-11-22T17:41:29Z'
count: 1
type: Warning
eventTime: null
reportingComponent: >-
  csi.hetzner.cloud_hcloud-csi-controller-567f89966f-4x2w5_2950c7a3-f602-4696-95f9-d014acc08356
reportingInstance: ''

Additional information

No response

tcldr avatar Nov 22 '25 18:11 tcldr

Hey, I am not familiar with the Actions Runner Controller, but I can take a look at it.

It feels like ARC is expecting the csi-driver to be faster here. Did you already experiment with some config parameters to allow for longer startup times of the pods? My gut feeling teels me, this might be a limitation of the Volumes product/API and we can´t do a lot in the csi-driver. Nonetheless, I will see if I can reproduce this and see if there is something on our end.

Would you have any logs of the hcloud-csi-controller and hcloud-csi-node drivers? If you can provide some, please use the --all-containers flag, as the csi Pods consist of multiple containers.

Best Regards, Lukas

lukasmetzner avatar Nov 25 '25 07:11 lukasmetzner

Hey, I took a look at this.

The hcloud-csi-driver consists of multiple containers. One of them is the external-provisioner, who is responsible for watches the PVC objects and emits gRPC calls to the hcloud-csi-controller to Create/Delete/... Hetzner Cloud Volumes. The provisioner is doing a pre-check if the Volume is still attached to another node and emits the VolumeFailedDelete event. So this is not in our control and works as intended.

That said, you can see some additional events like this:

reason: ProvisioningFailed
message: >-
  failed to provision volume with StorageClass "hcloud-volumes": rpc error: code
  = DeadlineExceeded desc = context deadline exceeded

You might be able to get rid/reduce these events, by setting the external-provisioner --timeout flag to something like 45s to 60s. Currently, we don't offer a way to do this in the Helm chart, but if it works for you, we might add this. Could you try this out and see if it works for your case?

lukasmetzner avatar Nov 25 '25 11:11 lukasmetzner

Hi @lukasmetzner , thanks for taking a look at this. Since writing up the issue I've moved to using Open EBS local PV provisioner which has cleared up the errors completely. For ARC, it's likely a better fit anyway.

That said, I've noticed these errors appearing frequently on hcloud volumes – not just those using ARC. Particularly when an app that depends on them rolls out a new version and restarts a bunch of pods in series for example. I think the volume churn of the ARC controller just makes the issue harder to miss, but I'll try and capture any new logs that come up and attach them to this issue.

I've attached a copy of my setup of the Local PV helm repo/chart values that behaves with ARC as I'd expect. I don't have any detailed knowledge of how CSIs work under the hood, but perhaps It might help to identify differences between the attach/detach behaviour.

open-ebs.yaml

EDIT: I'm actually using the 'openebs-hostpath' storage class which is installed by the chart with ARC, so you can ignore the one in the yaml.

containerMode:
      type: kubernetes
      kubernetesModeWorkVolumeClaim:
        accessModes: ["ReadWriteOnce"]
        storageClassName: openebs-hostpath
        resources:
          requests:
            storage: 1Gi

tcldr avatar Nov 25 '25 13:11 tcldr