aws-ebs-csi-driver
aws-ebs-csi-driver copied to clipboard
Possible race condition in EBS volume creation
Hi 👋🏻
I think I've found a race condition that happens very occasionally when creating a PV for a new PVC. The outcome of the race condition is a PVC that is permanently stuck in Pending
and has to be deleted and recreated.
As a ballpark estimate, based on the number of CreateVolume
calls per day in CloudTrail and how frequently we see this issue, I would guess that this happens on the order of 1 in 100,000 volume creations.
I've substituted the actual PV and EBS volume names below with placeholders like pvc-foo
for ease of reading. The identifier of each PVC corresponds directly to the identifier of the EBS volume (i.e. PV pvc-foo
corresponds to EBS volume vol-foo
).
Summary
Based on CloudTrail events, it seems that aws-ebs-csi-driver sometimes creates and then shortly after deletes an EBS volume to provide a PV for a new PVC.
Once that has happened, the controller repeatedly retries the creation of the EBS volume, but it always fails because of how the API request to AWS is formed. Specifically:
- The name of the PV is generated from the UID of the PVC.
- The
clientToken
, which is used as an idempotency key in the request to the AWS API, is generated from the name of the PV. - This means that
clientToken
will always be the same for a specific PVC, and the retries will fail indefinitely.
From the driver side, every retry results in the following error:
E0228 05:00:15.227309 1 driver.go:155] "GRPC error" err="rpc error: code = AlreadyExists desc = Could not create volume \"pvc-foo\": Parameters on this idempotent request are inconsistent with parameters used in previous request(s)"
and in the CloudTrail event, the error message is:
The client token you have provided is associated with a resource that is already deleted. Please use a different client token.
All of the CloudTrail events all have the same userIdentity
(the service account assumed by the driver) and source IP address (the driver's pod IP).
Possible EBS volume ID confusion
One really weird thing I saw in the driver logs was this:
E0228 05:00:14.031881 1 driver.go:155] "GRPC error" err=<
rpc error: code = Internal desc = Could not create volume "pvc-foo": failed to get an available volume in EC2: InvalidVolume.NotFound: The volume 'vol-bar' does not exist.
status code: 400, request id: [redacted]
>
It seems to suggest that the driver was looking for an EBS volume with the ID vol-bar
when creating the PV pvc-foo
. That other volume belonged to a PV that was created about 10 minutes earlier and deleted while pvc-foo
was in the process of being provisioned (see timeline below).
In our application, due to backup schedules, we do a fair amount of disk creation and deletion on the hour mark. I'm wondering if the increased load at those times is enough to trigger a race condition inside the driver.
Timeline
All times in UTC
04:50:03 - CloudTrail - EBS volume vol-bar
created for pvc-bar
05:00:09 - CloudTrail - EBS volume vol-foo
created for pvc-foo
05:00:13 - CloudTrail - EBS volume vol-bar
deleted
05:00:14 - aws-ebs-csi-driver - Could not create PV pvc-foo
because it couldn't find EBS volume vol-bar
in AWS (see above section "Possible EBS volume ID confusion")
05:00:14 - CloudTrail - EBS volume vol-foo
deleted
05:00:15 - CloudTrail - EBS volume vol-foo
creation attempted again, but fails with Client.IdempotentParameterMismatch
(this repeats indefinitely)
Note: I can't be sure of the order of events between aws-ebs-csi-driver and CloudTrail. In particular, the two events at 05:00:14 may have happened in a different order.
Software versions
Kubernetes: v1.26.12-eks-5e0fdde aws-ebs-csi-driver: v1.28.0
Potentially related issues
I don't think any of these are the same issue as they affect every PVC created by the people reporting them. They seem more like config or compatibility issues on their end, whereas this seems more like a race condition.
- #1916: Someone seeing the same error message every time they try to create a volume, having just installed aws-ebs-csi-driver
- #1140: Someone who was missing KMS permissions and would deterministically see this error because the drive would fail to encrypt
- #1036: Someone else who was missing KMS permissions
More info
I'm happy to get any more info you need to debug this. I think CloudTrail events stick around for 90 days and our own logs should stick around for a decent amount of time too.
/kind bug
So I just got another two of these today, right at the same time and in the same Kubernetes cluster. What I found really interesting was that the logs showed them both pointing to a single EBS volume that was created for a different PVC:
E0304 11:46:12.516570 1 driver.go:155] "GRPC error" err=<
rpc error: code = Internal desc = Could not create volume "pvc-foo": failed to get an available volume in EC2: InvalidVolume.NotFound: The volume 'vol-baz' does not exist.
status code: 400, request id: [redacted]
>
E0304 11:46:12.525748 1 driver.go:155] "GRPC error" err=<
rpc error: code = Internal desc = Could not create volume "pvc-bar": failed to get an available volume in EC2: InvalidVolume.NotFound: The volume 'vol-baz' does not exist.
status code: 400, request id: [redacted]
>
That PVC shows as being deleted at 11:46:11 in CloudTrail, which fits the pattern from last time.
Hi, thanks for the great writeup - we've been vaguely aware of this issue for some time and it's on our backlog to fix. In the meantime, deleting and recreating the PVC/VolumeSnapshot object that is affected should act as a workaround for the rare times you hit this issue.
For us its happening all the time(may be 95/100), any updates would be appreciated
For us its happening all the time(may be 95/100), any updates would be appreciated
We're working on a fix for this issie. However, that failure rate is highly unusual and likely indicates your volume(s) are failing to create because of invalid parameters (for example, an invalid KMS key or one the driver doesn't have access to).