aws-efs-csi-driver
aws-efs-csi-driver copied to clipboard
Pods mounting EFS-CSI-driver-based volumes stuck in ContainerCreating for a long time because EFS volumes fail to mount (kubelet error "Unable to attach or mount volumes" [...] "timed out waiting for the condition")
Hi, we are using the EFS CSI driver (currently version 1.3.2) to provision EFS-based volumes to our workloads. On one of our clusters is currently suffering from a situation where freshly deployed pods that mount such volumes are stuck in ContainerCreating (resp. "Init:0/<numInitContainers>" for pods with init containers) for a very long time. Pods that are part of that same workload but do not mount EFS volumes do not suffer from that, so it 99,9% related to the EFS CSI driver.
This is how the (somewhat anonymized) workload presents itself when it is in that stuck state:
$ kubectl get pods -n mynamespace -o wide
NAME READY STATUS RESTARTS AGE IP NODE
abc-0 0/1 ContainerCreating 0 14m <none> ip-10-0-138-106.eu-central-1.compute.internal
abc-1 0/1 ContainerCreating 0 14m <none> ip-10-0-142-147.eu-central-1.compute.internal
foobar-0 0/1 ContainerCreating 0 14m <none> ip-10-0-138-106.eu-central-1.compute.internal
foobar-1 0/1 ContainerCreating 0 14m <none> ip-10-0-146-19.eu-central-1.compute.internal
job-without-efs-mounts-43611-jzqth 1/1 Running 0 14m 10.0.145.110 ip-10-0-145-28.eu-central-1.compute.internal
other-job-without-efs-mounts-43611-6qfbv 1/1 Running 0 14m 10.0.147.196 ip-10-0-145-28.eu-central-1.compute.internal
lala-default-0 0/1 Init:0/1 0 14m <none> ip-10-0-142-147.eu-central-1.compute.internal
meme-default-0 0/1 Init:0/2 0 14m <none> ip-10-0-145-28.eu-central-1.compute.internal
[...]
meme-default-2 0/1 Init:0/2 0 14m <none> ip-10-0-143-249.eu-central-1.compute.internal
As an example, these are the events for the pod meme-default-2 while the pod is in this state (note that the volume that does attach immediately without problems is an EBS volume, handled by the EBS-CSI driver):
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 24m default-scheduler 0/7 nodes are available: 1 node(s) had taint {foo: bar}, that the pod didn't tolerate, 2 node(s) had volume node affinity conflict, 4 Insufficient cpu.
Normal TriggeredScaleUp 24m cluster-autoscaler pod triggered scale-up: [{eks-xxx-default2-integral-thrush-7cc12cf7-ecbb-2a1c-fb85-efbf1546dc08 1->2 (max: 99)}]
Warning FailedScheduling 23m (x2 over 23m) default-scheduler 0/8 nodes are available: 1 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate, 1 node(s) had taint {foo: bar}, that the pod didn't tolerate, 2 node(s) had volume node affinity conflict, 4 Insufficient cpu.
Warning FailedScheduling 22m (x2 over 23m) default-scheduler 0/8 nodes are available: 1 node(s) had taint {foo: bar}, that the pod didn't tolerate, 3 node(s) had volume node affinity conflict, 4 Insufficient cpu.
Normal Scheduled 22m default-scheduler Successfully assigned mynamespace/meme-default-3 to ip-10-0-143-249.eu-central-1.compute.internal
Normal SuccessfulAttachVolume 22m attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-dcc8cc2b-8729-5378-080d-c1639e0e8a1e"
Warning FailedMount 16m kubelet Unable to attach or mount volumes: unmounted volumes=[some-efs-volume], unattached volumes=[tmp feature-toggle data-volume some-efs-volume kube-api-access-prc4n]: timed out waiting for the condition
Warning FailedMount 13m (x2 over 18m) kubelet Unable to attach or mount volumes: unmounted volumes=[some-efs-volume], unattached volumes=[feature-toggle data-volume some-efs-volume kube-api-access-prc4n tmp]: timed out waiting for the condition
Warning FailedMount 2m39s kubelet Unable to attach or mount volumes: unmounted volumes=[some-efs-volume], unattached volumes=[some-efs-volume kube-api-access-prc4n tmp feature-toggle data-volume]: timed out waiting for the condition
Warning FailedAttachVolume 2m35s (x9 over 20m) attachdetach-controller AttachVolume.Attach failed for volume "pvc-8ff25a34-6018-47a6-b11d-818cb39af55f" : Attach timeout for volume fs-0aaeaaaaaaaaaa::fsap-095aaaaaaaaaaaa
Warning FailedMount 25s (x6 over 20m) kubelet Unable to attach or mount volumes: unmounted volumes=[some-efs-volume], unattached volumes=[kube-api-access-prc4n tmp feature-toggle data-volume some-efs-volume]: timed out waiting for the condition
Note that in this example, the cluster autoscaler did perform a scale-up, but the issue also occurs on pods scheduled on already existing nodes. So I don't think that the autoscaler is involved in the problem.
The EFS CSI node pod on the node on which the above pod is scheduled logs no obvious errors (at least for someone not familiar with the inner workings of the EFS CSI driver)
$ kubectl logs -n management efs-csi-node-2kqrl efs-plugin
I0915 14:24:39.398396 1 config_dir.go:87] Creating symlink from '/etc/amazon/efs' to '/var/amazon/efs'
I0915 14:24:39.401588 1 mount_linux.go:173] Cannot run systemd-run, assuming non-systemd OS
I0915 14:24:39.401601 1 driver.go:140] Did not find any input tags.
I0915 14:24:39.401706 1 driver.go:113] Registering Node Server
I0915 14:24:39.401715 1 driver.go:115] Registering Controller Server
I0915 14:24:39.401722 1 driver.go:118] Starting watchdog
I0915 14:24:39.401762 1 efs_watch_dog.go:209] Copying /etc/amazon/efs/efs-utils.conf since it doesn't exist
I0915 14:24:39.401814 1 efs_watch_dog.go:209] Copying /etc/amazon/efs/efs-utils.crt since it doesn't exist
I0915 14:24:39.402607 1 driver.go:124] Staring subreaper
I0915 14:24:39.402622 1 driver.go:127] Listening for connections on address: &net.UnixAddr{Name:"/csi/csi.sock", Net:"unix"}
$ kubectl logs -n management efs-csi-node-2kqrl csi-driver-registrar
I0915 14:24:45.939706 1 main.go:113] Version: v2.1.0-0-g80d42f24
I0915 14:24:45.940147 1 main.go:137] Attempting to open a gRPC connection with: "/csi/csi.sock"
I0915 14:24:45.940161 1 connection.go:153] Connecting to unix:///csi/csi.sock
I0915 14:24:45.940473 1 main.go:144] Calling CSI driver to discover driver name
I0915 14:24:45.942572 1 main.go:154] CSI driver name: "efs.csi.aws.com"
I0915 14:24:45.942594 1 node_register.go:52] Starting Registration Server at: /registration/efs.csi.aws.com-reg.sock
I0915 14:24:45.942679 1 node_register.go:61] Registration Server started at: /registration/efs.csi.aws.com-reg.sock
I0915 14:24:45.942721 1 node_register.go:83] Skipping healthz server because HTTP endpoint is set to: ""
I0915 14:24:46.371532 1 main.go:80] Received GetInfo call: &InfoRequest{}
I0915 14:24:46.402238 1 main.go:90] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:true,Error:,}
Eventually, the attaching/mounting of the EFS volumes will succeed, this can be 10-15 minutes, but sometimes hours. Usually, when the mounting works, it will work for all pods that are currently stuck. But the problem is not gone - when I later scale up a workload (or have new pod launched by, e.g., a cronjob), these new pods will often be stuck again. For example, here we have the pods of a cronjob (running once an hour) not being scheduled for more than two hours because of this problem. Scaling up the "meme" workload to 4 instances has the new pod No. 3 stuck again:
[...]
some-cronjob-27720780-ltkmt 0/1 ContainerCreating 0 133m <none> ip-10-0-147-174.eu-central-1.compute.internal <none> <none>
some-cronjob-27720840-h9zcw 0/1 ContainerCreating 0 73m <none> ip-10-0-147-75.eu-central-1.compute.internal <none> <none>
some-cronjob-27720900-fgmmm 0/1 ContainerCreating 0 13m <none> ip-10-0-147-174.eu-central-1.compute.internal <none> <none>
[...]
meme-default-0 1/1 Running 0 3h1m 10.0.146.62 ip-10-0-145-28.eu-central-1.compute.internal <none> <none>
meme-default-1 1/1 Running 0 163m 10.0.144.232 ip-10-0-147-174.eu-central-1.compute.internal <none> <none>
meme-default-2 1/1 Running 0 160m 10.0.146.237 ip-10-0-147-75.eu-central-1.compute.internal <none> <none>
meme-default-3 0/1 Init:0/2 0 49m <none> ip-10-0-143-249.eu-central-1.compute.internal <none> <none>
Restarting the EFS-CSI driver pods (both the efs-csi-node DS and efs-csi-controller deployment) sometimes seemed to help, currently it doesn't. Restarting all nodes temporarily fixed it, but the problem will later occur again.
I mentioned that we are observing this in one cluster only at this time. What separates this cluster from others is that only on this cluster, we have a high "workload churn" - the cluster runs several deployments of our application in different namespaces, which are refreshed (i.e. deleted and recreated) several times a day. This deletion includes the EFS-based volumes (we implicitly delete their PVCs by deleting the namespace. The storage class we use for dynamic provisioning has its Reclaim Policy set to Delete, so PVs are also deleted, as are the associated EFS Access Points. On most of our other clusters, we create deployments and then use them for a longer period of time, only performing minor changes (e.g., rollout patches), but keeping the EFS volumes.
I am with the same problem. EFS version 1.4.2, cluster and nodes in kuberentes version 1.19.
@jgoeres Did you find any solution?
Thanks
Me too
Try setting resources requests for containers. Haven't seen this error for quite a while after adding them. https://github.com/kubernetes-sigs/aws-efs-csi-driver/issues/325#issuecomment-948639653
We have experienced the same problem on one of our clusters with high workload. We already have setup resources request, but this doesn't help. EFS driver version: 1.4.0 k8s version: 1.21
we have the same issue EKS: 1.21.14 EFS Driver Version 1.4.0
Meet the same problem EKS Version : 1.21 EFS Driver Version 1.4.0
This issue might be resolved by upgrading to the latest driver version, v1.4.9. In v1.4.8, we fixed a concurrency issue with efs-utils that could cause this to happen.
If anyone runs into this again, can you please follow the troubleshooting guide to enable efs-utils debug logging, execute the log collector script, and then post any relevant errors from the mount.log
file? This file contains the logs for efs-utils, which is doing the actual mounting "under the hood" of the csi driver.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale
- Close this issue with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten
- Close this issue with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
I'm noticing this problem on EFS CSI v1.5.6.
Pod Event Error
Warning FailedAttachVolume 107s (x9 over 19m) attachdetach-controller AttachVolume.Attach failed for volume "pvc-8a00b9f5-58e0-4e2d-a294-8a9c45e57a1a" : timed out waiting for external-attacher of efs.csi.aws.com CSI driver to attach volume fs-2a825351::fsap-0d75583a12ada3174
These are the log dumps from the log_collector.py tool.
driver_info
kubectl describe pod efs-csi-node-w4sl9 -n kube-system
Name: efs-csi-node-w4sl9
Namespace: kube-system
Priority: 2000001000
Priority Class Name: system-node-critical
Service Account: efs-csi-node-sa
Node: ip-10-116-161-48.us-east-2.compute.internal/10.116.161.48
Start Time: Thu, 15 Jun 2023 20:05:42 -0500
Labels: app=efs-csi-node
app.kubernetes.io/instance=efs-csi-awscmh2
app.kubernetes.io/name=aws-efs-csi-driver
controller-revision-hash=7dbf8cbdd4
pod-template-generation=7
Annotations: apps.indeed.com/ship-logs: true
kubernetes.io/psp: privileged
vpaObservedContainers: efs-plugin, csi-driver-registrar, liveness-probe
vpaUpdates:
Pod resources updated by efs-csi-node: container 0: cpu request, memory request; container 1: cpu request, memory request; container 2: cp...
Status: Running
IP: 10.116.161.48
IPs:
IP: 10.116.161.48
Controlled By: DaemonSet/efs-csi-node
Containers:
efs-plugin:
Container ID: containerd://13db8a2a7ac72c870487495ec95aa197767b056c2d65baab0a5be42b17a37cd1
Image: harbor.indeed.tech/dockerhub-proxy/amazon/aws-efs-csi-driver:v1.5.6
Image ID: harbor.indeed.tech/dockerhub-proxy/amazon/aws-efs-csi-driver@sha256:cba55174d2df13e9939a83b9d71e8b74f6a27ada2e44252ac80136e33a992d6e
Port: 9809/TCP
Host Port: 9809/TCP
Args:
--endpoint=$(CSI_ENDPOINT)
--logtostderr
--v=5
--vol-metrics-opt-in=false
--vol-metrics-refresh-period=240
--vol-metrics-fs-rate-limit=5
State: Running
Started: Thu, 15 Jun 2023 20:05:48 -0500
Ready: True
Restart Count: 0
Requests:
cpu: 100m
memory: 128Mi
Liveness: http-get http://:healthz/healthz delay=10s timeout=3s period=2s #success=1 #failure=5
Environment:
CSI_ENDPOINT: unix:/csi/csi.sock
Mounts:
/csi from plugin-dir (rw)
/etc/amazon/efs-legacy from efs-utils-config-legacy (rw)
/var/amazon/efs from efs-utils-config (rw)
/var/lib/kubelet from kubelet-dir (rw)
/var/run/efs from efs-state-dir (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xw45q (ro)
csi-driver-registrar:
Container ID: containerd://00a9ea19ed72327e5f808bd87a408f81629c5e86abc8e103773006308eba5f98
Image: public.ecr.aws/eks-distro/kubernetes-csi/node-driver-registrar:v2.8.0-eks-1-27-3
Image ID: public.ecr.aws/eks-distro/kubernetes-csi/node-driver-registrar@sha256:74e13dfff1d73b0e39ae5883b5843d1672258b34f7d4757995c72d92a26bed1e
Port: <none>
Host Port: <none>
Args:
--csi-address=$(ADDRESS)
--kubelet-registration-path=$(DRIVER_REG_SOCK_PATH)
--v=5
State: Running
Started: Thu, 15 Jun 2023 20:05:49 -0500
Ready: True
Restart Count: 0
Requests:
cpu: 100m
memory: 128Mi
Environment:
ADDRESS: /csi/csi.sock
DRIVER_REG_SOCK_PATH: /var/lib/kubelet/plugins/efs.csi.aws.com/csi.sock
KUBE_NODE_NAME: (v1:spec.nodeName)
Mounts:
/csi from plugin-dir (rw)
/registration from registration-dir (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xw45q (ro)
liveness-probe:
Container ID: containerd://c9e7ab896df75b1249cbbf489adf8fe31d57e2caaf69d49b71a24c3a25858e39
Image: public.ecr.aws/eks-distro/kubernetes-csi/livenessprobe:v2.10.0-eks-1-27-3
Image ID: public.ecr.aws/eks-distro/kubernetes-csi/livenessprobe@sha256:25b4d3f9cf686ac464a742ead16e705da3adcfe574296dd75c5c05ec7473a513
Port: <none>
Host Port: <none>
Args:
--csi-address=/csi/csi.sock
--health-port=9809
--v=5
State: Running
Started: Thu, 15 Jun 2023 20:05:50 -0500
Ready: True
Restart Count: 0
Requests:
cpu: 100m
memory: 128Mi
Environment: <none>
Mounts:
/csi from plugin-dir (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xw45q (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
kubelet-dir:
Type: HostPath (bare host directory volume)
Path: /var/lib/kubelet
HostPathType: Directory
plugin-dir:
Type: HostPath (bare host directory volume)
Path: /var/lib/kubelet/plugins/efs.csi.aws.com/
HostPathType: DirectoryOrCreate
registration-dir:
Type: HostPath (bare host directory volume)
Path: /var/lib/kubelet/plugins_registry/
HostPathType: Directory
efs-state-dir:
Type: HostPath (bare host directory volume)
Path: /var/run/efs
HostPathType: DirectoryOrCreate
efs-utils-config:
Type: HostPath (bare host directory volume)
Path: /var/amazon/efs
HostPathType: DirectoryOrCreate
efs-utils-config-legacy:
Type: HostPath (bare host directory volume)
Path: /etc/amazon/efs
HostPathType: DirectoryOrCreate
kube-api-access-xw45q:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: kubernetes.io/os=linux
Tolerations: op=Exists
node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/network-unavailable:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events: <none>
kubectl get pod efs-csi-node-w4sl9 -n kube-system -o yaml
apiVersion: v1
kind: Pod
metadata:
annotations:
apps.indeed.com/ship-logs: "true"
kubernetes.io/psp: privileged
vpaObservedContainers: efs-plugin, csi-driver-registrar, liveness-probe
vpaUpdates: 'Pod resources updated by efs-csi-node: container 0: cpu request,
memory request; container 1: cpu request, memory request; container 2: cpu request,
memory request'
creationTimestamp: "2023-06-16T01:05:42Z"
generateName: efs-csi-node-
labels:
app: efs-csi-node
app.kubernetes.io/instance: efs-csi-awscmh2
app.kubernetes.io/name: aws-efs-csi-driver
controller-revision-hash: 7dbf8cbdd4
pod-template-generation: "7"
name: efs-csi-node-w4sl9
namespace: kube-system
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: DaemonSet
name: efs-csi-node
uid: aa1527ec-97b6-498c-a21d-9a642d26c242
resourceVersion: "2386821689"
uid: eccdbf2a-3285-4adc-8ad2-c7ba68c33f02
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchFields:
- key: metadata.name
operator: In
values:
- ip-10-116-161-48.us-east-2.compute.internal
containers:
- args:
- --endpoint=$(CSI_ENDPOINT)
- --logtostderr
- --v=5
- --vol-metrics-opt-in=false
- --vol-metrics-refresh-period=240
- --vol-metrics-fs-rate-limit=5
env:
- name: CSI_ENDPOINT
value: unix:/csi/csi.sock
image: harbor.indeed.tech/dockerhub-proxy/amazon/aws-efs-csi-driver:v1.5.6
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 5
httpGet:
path: /healthz
port: healthz
scheme: HTTP
initialDelaySeconds: 10
periodSeconds: 2
successThreshold: 1
timeoutSeconds: 3
name: efs-plugin
ports:
- containerPort: 9809
hostPort: 9809
name: healthz
protocol: TCP
resources:
requests:
cpu: 100m
memory: 128Mi
securityContext:
privileged: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/lib/kubelet
mountPropagation: Bidirectional
name: kubelet-dir
- mountPath: /csi
name: plugin-dir
- mountPath: /var/run/efs
name: efs-state-dir
- mountPath: /var/amazon/efs
name: efs-utils-config
- mountPath: /etc/amazon/efs-legacy
name: efs-utils-config-legacy
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-xw45q
readOnly: true
- args:
- --csi-address=$(ADDRESS)
- --kubelet-registration-path=$(DRIVER_REG_SOCK_PATH)
- --v=5
env:
- name: ADDRESS
value: /csi/csi.sock
- name: DRIVER_REG_SOCK_PATH
value: /var/lib/kubelet/plugins/efs.csi.aws.com/csi.sock
- name: KUBE_NODE_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
image: public.ecr.aws/eks-distro/kubernetes-csi/node-driver-registrar:v2.8.0-eks-1-27-3
imagePullPolicy: IfNotPresent
name: csi-driver-registrar
resources:
requests:
cpu: 100m
memory: 128Mi
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /csi
name: plugin-dir
- mountPath: /registration
name: registration-dir
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-xw45q
readOnly: true
- args:
- --csi-address=/csi/csi.sock
- --health-port=9809
- --v=5
image: public.ecr.aws/eks-distro/kubernetes-csi/livenessprobe:v2.10.0-eks-1-27-3
imagePullPolicy: IfNotPresent
name: liveness-probe
resources:
requests:
cpu: 100m
memory: 128Mi
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /csi
name: plugin-dir
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-xw45q
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
hostNetwork: true
nodeName: ip-10-116-161-48.us-east-2.compute.internal
nodeSelector:
kubernetes.io/os: linux
preemptionPolicy: PreemptLowerPriority
priority: 2000001000
priorityClassName: system-node-critical
restartPolicy: Always
schedulerName: default-scheduler
securityContext:
fsGroup: 0
runAsGroup: 0
runAsNonRoot: false
runAsUser: 0
serviceAccount: efs-csi-node-sa
serviceAccountName: efs-csi-node-sa
terminationGracePeriodSeconds: 30
tolerations:
- operator: Exists
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/disk-pressure
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/memory-pressure
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/pid-pressure
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/unschedulable
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/network-unavailable
operator: Exists
volumes:
- hostPath:
path: /var/lib/kubelet
type: Directory
name: kubelet-dir
- hostPath:
path: /var/lib/kubelet/plugins/efs.csi.aws.com/
type: DirectoryOrCreate
name: plugin-dir
- hostPath:
path: /var/lib/kubelet/plugins_registry/
type: Directory
name: registration-dir
- hostPath:
path: /var/run/efs
type: DirectoryOrCreate
name: efs-state-dir
- hostPath:
path: /var/amazon/efs
type: DirectoryOrCreate
name: efs-utils-config
- hostPath:
path: /etc/amazon/efs
type: DirectoryOrCreate
name: efs-utils-config-legacy
- name: kube-api-access-xw45q
projected:
defaultMode: 420
sources:
- serviceAccountToken:
expirationSeconds: 3607
path: token
- configMap:
items:
- key: ca.crt
path: ca.crt
name: kube-root-ca.crt
- downwardAPI:
items:
- fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
path: namespace
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2023-06-16T01:05:42Z"
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: "2023-06-16T01:05:51Z"
status: "True"
type: Ready
- lastProbeTime: null
lastTransitionTime: "2023-06-16T01:05:51Z"
status: "True"
type: ContainersReady
- lastProbeTime: null
lastTransitionTime: "2023-06-16T01:05:42Z"
status: "True"
type: PodScheduled
containerStatuses:
- containerID: containerd://00a9ea19ed72327e5f808bd87a408f81629c5e86abc8e103773006308eba5f98
image: public.ecr.aws/eks-distro/kubernetes-csi/node-driver-registrar:v2.8.0-eks-1-27-3
imageID: public.ecr.aws/eks-distro/kubernetes-csi/node-driver-registrar@sha256:74e13dfff1d73b0e39ae5883b5843d1672258b34f7d4757995c72d92a26bed1e
lastState: {}
name: csi-driver-registrar
ready: true
restartCount: 0
started: true
state:
running:
startedAt: "2023-06-16T01:05:49Z"
- containerID: containerd://13db8a2a7ac72c870487495ec95aa197767b056c2d65baab0a5be42b17a37cd1
image: harbor.indeed.tech/dockerhub-proxy/amazon/aws-efs-csi-driver:v1.5.6
imageID: harbor.indeed.tech/dockerhub-proxy/amazon/aws-efs-csi-driver@sha256:cba55174d2df13e9939a83b9d71e8b74f6a27ada2e44252ac80136e33a992d6e
lastState: {}
name: efs-plugin
ready: true
restartCount: 0
started: true
state:
running:
startedAt: "2023-06-16T01:05:48Z"
- containerID: containerd://c9e7ab896df75b1249cbbf489adf8fe31d57e2caaf69d49b71a24c3a25858e39
image: public.ecr.aws/eks-distro/kubernetes-csi/livenessprobe:v2.10.0-eks-1-27-3
imageID: public.ecr.aws/eks-distro/kubernetes-csi/livenessprobe@sha256:25b4d3f9cf686ac464a742ead16e705da3adcfe574296dd75c5c05ec7473a513
lastState: {}
name: liveness-probe
ready: true
restartCount: 0
started: true
state:
running:
startedAt: "2023-06-16T01:05:50Z"
hostIP: 10.116.161.48
phase: Running
podIP: 10.116.161.48
podIPs:
- ip: 10.116.161.48
qosClass: Burstable
startTime: "2023-06-16T01:05:42Z"
driver_logs
kubectl logs efs-csi-node-w4sl9 -n kube-system efs-plugin
I0616 01:05:48.928661 1 config_dir.go:88] Creating symlink from '/etc/amazon/efs' to '/var/amazon/efs'
I0616 01:05:48.929567 1 metadata.go:63] getting MetadataService...
I0616 01:05:48.931589 1 metadata.go:68] retrieving metadata from EC2 metadata service
I0616 01:05:48.932454 1 cloud.go:137] EFS Client created using the following endpoint: https://elasticfilesystem.us-east-2.amazonaws.com
I0616 01:05:48.932478 1 driver.go:84] Node Service capability for Get Volume Stats Not enabled
I0616 01:05:48.932588 1 driver.go:140] Did not find any input tags.
I0616 01:05:48.932739 1 driver.go:113] Registering Node Server
I0616 01:05:48.932752 1 driver.go:115] Registering Controller Server
I0616 01:05:48.932758 1 driver.go:118] Starting efs-utils watchdog
I0616 01:05:48.932833 1 efs_watch_dog.go:221] Skip copying /etc/amazon/efs/efs-utils.conf since it exists already
I0616 01:05:48.932846 1 efs_watch_dog.go:221] Skip copying /etc/amazon/efs/efs-utils.crt since it exists already
I0616 01:05:48.933148 1 driver.go:124] Starting reaper
I0616 01:05:48.933167 1 driver.go:127] Listening for connections on address: &net.UnixAddr{Name:"/csi/csi.sock", Net:"unix"}
I0616 01:05:50.285468 1 node.go:306] NodeGetInfo: called with args
efs_utils_logs (something seems wrong here)
kubectl exec efs-csi-node-w4sl9 -n kube-system -c efs-plugin -- find /var/log/amazon/efs -type f -exec echo {} \; -exec cat {} \; -exec echo \;
find: 'echo': No such file or directory
efs_utils_state_dir
kubectl exec efs-csi-node-w4sl9 -n kube-system -c efs-plugin -- find /var/run/efs -type f -exec echo {} \; -exec cat {} \; -exec echo \;
mounts
kubectl exec efs-csi-node-w4sl9 -n kube-system -c efs-plugin -- mount |grep nfs
After further digging in our case, we noticed that the CSIDriver resource was missing in the cluster where the problem above was occurring. We have no idea why it's missing, but manually recreating it caused the controller to start working again.
This doesn't seem to be the first time an issue with the CSIDriver resource was noticed during a helm upgrade. https://github.com/kubernetes-sigs/aws-efs-csi-driver/issues/325#issuecomment-779385896
@wmgroot I just experienced the same issue, are you using ArgoCD? I'm still debugging the behaviour, but I can reproduce a "Delete CSIDriver"-diff.
I believe it's related to how helm hooks are used in the chart for that resource and how ArgoCD is handling them.
We are using ArgoCD to manage our EFS CSI installation, yes. We check our Argo diffs as part of our upgrade process and I do not remember seeing a deletion of the CSIDriver, but it's possible that we missed this during a previous upgrade or I wasn't paying enough attention.
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Reopen this issue with
/reopen
- Mark this issue as fresh with
/remove-lifecycle rotten
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied- After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied- After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closedYou can:
- Reopen this issue with
/reopen
- Mark this issue as fresh with
/remove-lifecycle rotten
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.