aws-ebs-csi-driver
aws-ebs-csi-driver copied to clipboard
Race Condition - PVs don't get reused when starting new node
/kind bug
What happened?
I'm running into an issue where existing PVs are only re-used if a node is available at time of PVC creation.
When I scale up the pods they create new PVCs right away (while the pod moves into ContainerCreating). If there are nodes available, an existing PV is bound to the PVC right away. If there are no nodes available then the PVC is pending and as soon as a new node moves into a Ready state, a new PV will be provisioned and bound even though there are 100+ existing PVs that meet the requirements. Then if I schedule a new pod on that new node, an existing PV will be used for subsequent pods attached to the node. It's worth noting I am scaling nodes with karpenter, and have locked it down to a single availability zone so all PVs are in a single zone.
I've ended up with hundreds of PVs for something that dynamically scales between 0 and 6 pods. This is an actions-runner from the actions-runner-controller to run github actions on EKS.
Additional Testing
I deleted all the PVs Then in a single AZ I created 60 pods which created 60 PVs Then I scaled to 0, waited a while and made sure everything was available, then scaled to 60 - this created 28 more pods for a total of 88. The rest were bound to existing pods. Then I did it again and this time it created 25 more pods for a total of 113. This was because there were some 2xl nodes that allowed for more pods to join. It seems that that the first pod to join the node is creating a new PV while the second (and sometimes 3rd) pod to join is using an existing PV.
Relevant Logs
the only logs the csi-controller produces are
I1024 14:16:00.140116 1 cloud.go:713] "Waiting for volume state" volumeID="vol-07da60cb4e75fa23b" actual="attaching" desired="attached"
I1024 14:16:45.736756 1 cloud.go:713] "Waiting for volume state" volumeID="vol-017224c77fe3e01f6" actual="attaching" desired="attached"
And the ebs-csi-node that comes up in response to the new node shows
Defaulted container "ebs-plugin" out of: ebs-plugin, node-driver-registrar, liveness-probe
I1024 14:16:39.952320 1 driver.go:75] "Driver Information" Driver="ebs.csi.aws.com" Version="v1.19.0"
I1024 14:16:39.952362 1 node.go:85] "regionFromSession Node service" region=""
I1024 14:16:39.952371 1 metadata.go:85] "retrieving instance data from ec2 metadata"
I1024 14:16:39.953346 1 metadata.go:92] "ec2 metadata is available"
I1024 14:16:39.953741 1 metadata_ec2.go:25] "Retrieving EC2 instance identity metadata" regionFromSession=""
I1024 14:16:49.743118 1 mount_linux.go:517] Disk "/dev/nvme1n1" appears to be unformatted, attempting to format as type: "ext4" with options: [-F -m0 /dev/nvme1n1]
I1024 14:16:50.081231 1 mount_linux.go:528] Disk successfully formatted (mkfs): ext4 - /dev/nvme1n1 /var/lib/kubelet/plugins/kubernetes.io/csi/ebs.csi.aws.com/ad9bcd0a40bcd21382425af4ee754c0bd51e9e1a07000680a9e75a86ab0bb7d5/globalmount
I1024 14:16:50.081317 1 mount_linux.go:245] Detected OS without systemd
which seem to pertain to the root volume provisioning (which is working well), but I'm concerned about the mounted volume
volumeMounts:
- name: var-lib-docker
mountPath: /var/lib/docker
...
volumeClaimTemplates:
- metadata:
name: var-lib-docker
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 22Gi
storageClassName: arc-cache-infra-tests
which uses the storage class
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: arc-cache-infra-tests
labels:
content: arc-cache-infra-tests
provisioner: ebs.csi.aws.com
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
Expected Behavior
PVs should be reused when a new node starts. New PVs should only be created when existing PVs are unavailable.
Reproduction Steps
How to reproduce it (as minimally and precisely as possible)? You can use the actions-runners, but I have also simulated this with statefulsets to make it easier to reproduce.
# StorageClass yaml
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: arc-cache-infra-tests
labels:
content: arc-cache-infra-tests
provisioner: ebs.csi.aws.com
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
# StatefulSet yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: busybox-statefulset
namespace: actions-runner-system
spec:
serviceName: "busybox"
replicas: 20
selector:
matchLabels:
app: busybox
template:
metadata:
labels:
app: busybox
spec:
serviceAccountName: runner-sa
tolerations:
- key: purpose
operator: Equal
value: github-runner
effect: NoSchedule
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: purpose
operator: In
values:
- github-runner
containers:
- name: busybox
image: busybox
command: ["tail", "-f", "/dev/null"]
resources:
requests:
cpu: "1500m"
memory: "1500Mi"
limits:
cpu: "1500m"
memory: "1500Mi"
volumeMounts:
- name: var-lib-docker
mountPath: /var/lib/docker
volumeClaimTemplates:
- metadata:
name: var-lib-docker
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 22Gi
storageClassName: arc-cache-infra-tests
# Karpenter Provisioner and AWSNodeTemplate
---
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
name: github-runner-testing-cpu-75-c7a
spec:
weight: 50
limits:
resources:
cpu: '300'
providerRef:
name: github-runner-75
consolidation:
enabled: false
ttlSecondsUntilExpired: 600 # 10 mins
ttlSecondsAfterEmpty: 600 # 10 mins
taints:
- key: purpose
value: github-runner
effect: NoSchedule
labels:
scheduler: karpenter
purpose: github-runner
constraint: cpu # cpu or memory
size: large
lifecycle: ephemeral # ephemeral or persistent
usage: testing
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: [spot]
- key: karpenter.k8s.aws/instance-family
operator: In
values: [c7a]
- key: karpenter.k8s.aws/instance-size
operator: In
values: [xlarge]
- key: topology.kubernetes.io/zone
operator: In
values: [us-west-2a]
- key: kubernetes.io/os
operator: In
values:
- linux
- key: kubernetes.io/arch
operator: In
values:
- amd64
---
apiVersion: karpenter.k8s.aws/v1alpha1
kind: AWSNodeTemplate
metadata:
name: github-runner-75
spec:
blockDeviceMappings:
- deviceName: /dev/xvda
ebs:
volumeSize: 75Gi
volumeType: gp3
encrypted: true
subnetSelector:
karpenter.sh/discovery: primary-cluster
securityGroupSelector:
karpenter.sh/discovery: primary-cluster
instanceProfile: github-instance-profile
metadataOptions:
httpEndpoint: enabled
httpProtocolIPv6: disabled
httpPutResponseHopLimit: 2
httpTokens: optional
Environment AWS EKS
- Kubernetes version (use
kubectl version
): Client Version: v1.28.1 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: v1.27.4-eks-2d98532 - Driver version: 1.19
I see the same issue for the following environment: Environment AWS EKS
Kubernetes version (use kubectl version): Client Version: v1.26.11 Kustomize Version: v4.5.7 Server Version: v1.25.16-eks-8cb36c9 Driver version: 2.22.0
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale
- Close this issue with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten
- Close this issue with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
/remove-lifecycle rotten
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale
- Close this issue with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale