aws-ebs-csi-driver icon indicating copy to clipboard operation
aws-ebs-csi-driver copied to clipboard

Safeguard against outdated `/dev/disk/by-id/` symlinks that can lead Pod to mount the wrong volume

Open ialidzhikov opened this issue 2 years ago • 5 comments

/sig storage /kind bug

What happened?

For nvme volumes the aws-ebs-csi-driver relies on the /dev/disk/by-id/ symlink to be able to determine the nvme device attached for the given volume id.

https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/0de2586eab537209dcddb8150db52b2409c996cd/pkg/driver/node_linux.go#L83-L88

This symlink is updated by udev rules that react on kernel attach/detach events.

We also know that during Pod restarts volumes can be detached and quickly reattached at another location (i.e. /dev/nvme3n1 now, but after detach/attach cycle it would be /dev/nvme4n1).

A known cloud-init bug causes udev rules to be processed with a huge delay. To demonstrate the impact of the corresponding bug let's compare the udevadm monitor -s block output on a "healthy" and "affected" Node:

  • "healthy" Node

    udevadm monitor -s block
    monitor will print the received events for:
    UDEV - the event which udev sends out after rule processing
    KERNEL - the kernel uevent
    
    KERNEL[735.827384] change   /devices/pci0000:00/0000:00:1d.0/nvme/nvme3/nvme3n1 (block)
    UDEV  [735.850150] change   /devices/pci0000:00/0000:00:1d.0/nvme/nvme3/nvme3n1 (block)
    KERNEL[737.628955] change   /devices/pci0000:00/0000:00:1c.0/nvme/nvme4/nvme4n1 (block)
    KERNEL[737.657828] change   /devices/pci0000:00/0000:00:1e.0/nvme/nvme2/nvme2n1 (block)
    UDEV  [737.681569] change   /devices/pci0000:00/0000:00:1c.0/nvme/nvme4/nvme4n1 (block)
    UDEV  [737.695555] change   /devices/pci0000:00/0000:00:1e.0/nvme/nvme2/nvme2n1 (block)
    KERNEL[738.219222] change   /devices/pci0000:00/0000:00:1f.0/nvme/nvme1/nvme1n1 (block)
    UDEV  [738.246151] change   /devices/pci0000:00/0000:00:1f.0/nvme/nvme1/nvme1n1 (block)
    
  • "affected" Node

    $ udevadm monitor -s block
    monitor will print the received events for:
    UDEV - the event which udev sends out after rule processing
    KERNEL - the kernel uevent
    
    KERNEL[1078.716035] change   /devices/pci0000:00/0000:00:1c.0/nvme/nvme4/nvme4n1 (block)
    KERNEL[1078.920058] change   /devices/pci0000:00/0000:00:1f.0/nvme/nvme1/nvme1n1 (block)
    KERNEL[1079.209415] change   /devices/pci0000:00/0000:00:1e.0/nvme/nvme2/nvme2n1 (block)
    KERNEL[1088.412335] change   /devices/pci0000:00/0000:00:1d.0/nvme/nvme3/nvme3n1 (block)
    UDEV  [1122.943263] change   /devices/pci0000:00/0000:00:1c.0/nvme/nvme4/nvme4n1 (block)
    UDEV  [1122.982514] change   /devices/pci0000:00/0000:00:1f.0/nvme/nvme1/nvme1n1 (block)
    UDEV  [1123.093665] change   /devices/pci0000:00/0000:00:1e.0/nvme/nvme2/nvme2n1 (block)
    UDEV  [1148.605915] change   /devices/pci0000:00/0000:00:1d.0/nvme/nvme3/nvme3n1 (block)
    

In such case, aws-ebs-csi-driver can use an outdated /dev/disk/by-id/ symlink (that for example points to /dev/nvme3n1 for vol-1 but vol-1 is already attached to /dev/nvme4n1) and afterwards mount the wrong nvme device to the Pod.

What you expected to happen?

A safeguarding mechanism to exist in aws-ebs-csi-driver. We assume that the device's (/dev/nvme4n1) creation timestamp reflects the attachment time. The driver can ensure that the /dev/disk/by-id/ symlink timestamp is greater than the device's timestamp. Otherwise it would mean that the device was attached AFTER the symlink was created (by udev).

How to reproduce it (as minimally and precisely as possible)?

  1. Create a single Node cluster with a Linux distro that is affected by the cloud-init bug.

  2. Create 4 StatefulSets and 1 pause Deployment with 30 replicas (the pause Deployment is needed to trigger the cloud-init bug).

    Expand to see the manifests
    apiVersion: v1
    kind: Service
    metadata:
      name: app1
      labels:
        app: app1
    spec:
      ports:
      - port: 80
        name: web
      clusterIP: None
      selector:
        app: app1
    ---
    apiVersion: apps/v1
    kind: StatefulSet
    metadata:
      name: app1
    spec:
      serviceName: app1
      replicas: 1
      selector:
        matchLabels:
          app: app1
      template:
        metadata:
          labels:
            app: app1
        spec:
          containers:
            - name: app1
              image: centos
              command: ["/bin/sh"]
              args: ["-c", "while true; do echo $HOSTNAME $(date -u) >> /data/out.txt; sleep 5; done"]
              volumeMounts:
              - name: persistent-storage-app1
                mountPath: /data
      volumeClaimTemplates:
      - metadata:
          name: persistent-storage-app1
        spec:
          accessModes: [ "ReadWriteOnce" ]
          resources:
            requests:
              storage: 1Gi
    
    apiVersion: v1
    kind: Service
    metadata:
      name: app2
      labels:
        app: app2
    spec:
      ports:
      - port: 80
        name: web
      clusterIP: None
      selector:
        app: app2
    ---
    apiVersion: apps/v1
    kind: StatefulSet
    metadata:
      name: app2
    spec:
      serviceName: app2
      replicas: 1
      selector:
        matchLabels:
          app: app2
      template:
        metadata:
          labels:
            app: app2
        spec:
          containers:
            - name: app2
              image: centos
              command: ["/bin/sh"]
              args: ["-c", "while true; do echo $HOSTNAME $(date -u) >> /data/out.txt; sleep 5; done"]
              volumeMounts:
              - name: persistent-storage-app2
                mountPath: /data
      volumeClaimTemplates:
      - metadata:
          name: persistent-storage-app2
        spec:
          accessModes: [ "ReadWriteOnce" ]
          resources:
            requests:
              storage: 2Gi
    
    apiVersion: v1
    kind: Service
    metadata:
      name: app3
      labels:
        app: app3
    spec:
      ports:
      - port: 80
        name: web
      clusterIP: None
      selector:
        app: app3
    ---
    apiVersion: apps/v1
    kind: StatefulSet
    metadata:
      name: app3
    spec:
      serviceName: app3
      replicas: 1
      selector:
        matchLabels:
          app: app3
      template:
        metadata:
          labels:
            app: app3
        spec:
          containers:
            - name: app3
              image: centos
              command: ["/bin/sh"]
              args: ["-c", "while true; do echo $HOSTNAME $(date -u) >> /data/out.txt; sleep 5; done"]
              volumeMounts:
              - name: persistent-storage-app3
                mountPath: /data
      volumeClaimTemplates:
      - metadata:
          name: persistent-storage-app3
        spec:
          accessModes: [ "ReadWriteOnce" ]
          resources:
            requests:
              storage: 3Gi
    
    apiVersion: v1
    kind: Service
    metadata:
      name: app4
      labels:
        app: app4
    spec:
      ports:
      - port: 80
        name: web
      clusterIP: None
      selector:
        app: app4
    ---
    apiVersion: apps/v1
    kind: StatefulSet
    metadata:
      name: app4
    spec:
      serviceName: app4
      replicas: 1
      selector:
        matchLabels:
          app: app4
      template:
        metadata:
          labels:
            app: app4
        spec:
          containers:
            - name: app4
              image: centos
              command: ["/bin/sh"]
              args: ["-c", "while true; do echo $HOSTNAME $(date -u) >> /data/out.txt; sleep 5; done"]
              volumeMounts:
              - name: persistent-storage-app4
                mountPath: /data
      volumeClaimTemplates:
      - metadata:
          name: persistent-storage-app4
        spec:
          accessModes: [ "ReadWriteOnce" ]
          resources:
            requests:
              storage: 4Gi
    
    $ k create deploy pause --image busybox --replicas 30 -- sh -c "sleep 100d"
    
  3. Rollout the 4 StatefulSets and the Deployment.

    $ k rollout restart deploy pause; k rollout restart sts app1 app2 app3 app4
    
  4. Make sure that Pods mount the wrong PV:

    The following command should like the volumes in increasing order by their size (i.e app1-0 has to mount the volume with size 1Gi, etc.)

    for pod in app1-0 app2-0 app3-0 app4-0; do k exec $pod -- df -h /data; done
    

    Make sure that in some cases the order is wrong and Pod mounts the wrong PV:

    $ for pod in app1-0 app2-0 app3-0 app4-0; do k exec $pod -- df -h /data; done
    Filesystem      Size  Used Avail Use% Mounted on
    /dev/nvme3n1    2.0G  6.1M  1.9G   1% /data
    Filesystem      Size  Used Avail Use% Mounted on
    /dev/nvme1n1    976M  2.6M  958M   1% /data
    Filesystem      Size  Used Avail Use% Mounted on
    /dev/nvme4n1    2.9G  9.1M  2.9G   1% /data
    Filesystem      Size  Used Avail Use% Mounted on
    /dev/nvme2n1    3.9G   17M  3.8G   1% /data
    

    In this case we see Pod app1-0 mounts the volume of app2-0.

    $ k exec app1-0 -- tail /data/out.txt
    app2-0 Mon Apr 4 13:12:42 UTC 2022
    app2-0 Mon Apr 4 13:12:47 UTC 2022
    app2-0 Mon Apr 4 13:12:52 UTC 2022
    app2-0 Mon Apr 4 13:12:57 UTC 2022
    app2-0 Mon Apr 4 13:13:02 UTC 2022
    app2-0 Mon Apr 4 13:13:07 UTC 2022
    app2-0 Mon Apr 4 13:13:12 UTC 2022
    app2-0 Mon Apr 4 13:13:17 UTC 2022
    app1-0 Mon Apr 4 13:13:56 UTC 2022
    app1-0 Mon Apr 4 13:14:01 UTC 2022
    app1-0 Mon Apr 4 13:14:06 UTC 2022
    app1-0 Mon Apr 4 13:14:11 UTC 2022
    app1-0 Mon Apr 4 13:14:16 UTC 2022
    app1-0 Mon Apr 4 13:14:21 UTC 2022
    app1-0 Mon Apr 4 13:14:26 UTC 2022
    

Environment

  • Kubernetes version (use kubectl version): v1.21.10
  • Driver version: v1.5.0

Credits to @dguendisch for all of the investigations and the safeguarding suggestion!

ialidzhikov avatar May 02 '22 08:05 ialidzhikov

We are seeing similar issue on our end, but not sure it's related to to the cloud-init bug since on our nodes its version 19.3-45.amzn2.

In our case node root volume is mounted instead of the actual volume on pod restart.

It already caused at least two production outages and we had no luck replicating it on our development environments.

We are using c5d.18xlarge node on EKS v1.21 and with EBS CSI driver v1.6.1

Edit:

I think our issue is closer related to https://github.com/kubernetes-sigs/aws-ebs-csi-driver/issues/1166

RicardsRikmanis avatar May 23 '22 10:05 RicardsRikmanis

In our case node root volume is mounted instead of the actual volume on pod restart.

We hit the same issue on our side a lot of times. We believe the root cause for this issue gets fixed with https://github.com/kubernetes/kubernetes/pull/100183 (also backported to release-1.21 and present in K8s 1.21.9+). Hence, I would recommend you to upgrade to 1.21.9+. So far we didn't hit this issue after we upgraded to 1.21.10. Monitoring colleagues also implemented alerting for PVCs affected by this issue (the actual volume size not matching the PVC size in the spec) -> it helps getting notified right away when the issue occurs.

ialidzhikov avatar May 25 '22 07:05 ialidzhikov

Thanks for the info, that shed a lot of light on our issue!

We are at the mercy of AWS in regards Kubernetes patch upgrades. Now waiting till AWS EKS rolls out to 1.21.10/eks.7 and we will see if it also helped us.

RicardsRikmanis avatar May 26 '22 12:05 RicardsRikmanis

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Aug 24 '22 12:08 k8s-triage-robot

/remove-lifecycle stale

ialidzhikov avatar Aug 24 '22 12:08 ialidzhikov

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Nov 22 '22 12:11 k8s-triage-robot

/remove-lifecycle stale

ialidzhikov avatar Nov 22 '22 13:11 ialidzhikov

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Feb 20 '23 13:02 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Mar 22 '23 13:03 k8s-triage-robot

/remove-lifecycle rotten

ialidzhikov avatar Mar 22 '23 14:03 ialidzhikov

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jun 20 '23 14:06 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Jul 20 '23 15:07 k8s-triage-robot

/remove-lifecycle rotten

ialidzhikov avatar Oct 04 '23 06:10 ialidzhikov