Hi 👋🏻

I think I've found a race condition that happens very occasionally when creating a PV for a new PVC. The outcome of the race condition is a PVC that is permanently stuck in Pending and has to be deleted and recreated.

As a ballpark estimate, based on the number of CreateVolume calls per day in CloudTrail and how frequently we see this issue, I would guess that this happens on the order of 1 in 100,000 volume creations.

I've substituted the actual PV and EBS volume names below with placeholders like pvc-foo for ease of reading. The identifier of each PVC corresponds directly to the identifier of the EBS volume (i.e. PV pvc-foo corresponds to EBS volume vol-foo).

Summary

Based on CloudTrail events, it seems that aws-ebs-csi-driver sometimes creates and then shortly after deletes an EBS volume to provide a PV for a new PVC.

Once that has happened, the controller repeatedly retries the creation of the EBS volume, but it always fails because of how the API request to AWS is formed. Specifically:

The name of the PV is generated from the UID of the PVC.
The clientToken, which is used as an idempotency key in the request to the AWS API, is generated from the name of the PV.
This means that clientToken will always be the same for a specific PVC, and the retries will fail indefinitely.

From the driver side, every retry results in the following error:

E0228 05:00:15.227309       1 driver.go:155] "GRPC error" err="rpc error: code = AlreadyExists desc = Could not create volume \"pvc-foo\": Parameters on this idempotent request are inconsistent with parameters used in previous request(s)"

and in the CloudTrail event, the error message is:

The client token you have provided is associated with a resource that is already deleted. Please use a different client token.

All of the CloudTrail events all have the same userIdentity (the service account assumed by the driver) and source IP address (the driver's pod IP).

Possible EBS volume ID confusion

One really weird thing I saw in the driver logs was this:

E0228 05:00:14.031881       1 driver.go:155] "GRPC error" err=<
        rpc error: code = Internal desc = Could not create volume "pvc-foo": failed to get an available volume in EC2: InvalidVolume.NotFound: The volume 'vol-bar' does not exist.
                status code: 400, request id: [redacted]
 >

It seems to suggest that the driver was looking for an EBS volume with the ID vol-bar when creating the PV pvc-foo. That other volume belonged to a PV that was created about 10 minutes earlier and deleted while pvc-foo was in the process of being provisioned (see timeline below).

In our application, due to backup schedules, we do a fair amount of disk creation and deletion on the hour mark. I'm wondering if the increased load at those times is enough to trigger a race condition inside the driver.

Timeline

All times in UTC

04:50:03 - CloudTrail - EBS volume vol-bar created for pvc-bar 05:00:09 - CloudTrail - EBS volume vol-foo created for pvc-foo 05:00:13 - CloudTrail - EBS volume vol-bar deleted 05:00:14 - aws-ebs-csi-driver - Could not create PV pvc-foo because it couldn't find EBS volume vol-bar in AWS (see above section "Possible EBS volume ID confusion") 05:00:14 - CloudTrail - EBS volume vol-foo deleted 05:00:15 - CloudTrail - EBS volume vol-foo creation attempted again, but fails with Client.IdempotentParameterMismatch (this repeats indefinitely)

Note: I can't be sure of the order of events between aws-ebs-csi-driver and CloudTrail. In particular, the two events at 05:00:14 may have happened in a different order.

Software versions

Kubernetes: v1.26.12-eks-5e0fdde aws-ebs-csi-driver: v1.28.0

Potentially related issues

I don't think any of these are the same issue as they affect every PVC created by the people reporting them. They seem more like config or compatibility issues on their end, whereas this seems more like a race condition.

#1916: Someone seeing the same error message every time they try to create a volume, having just installed aws-ebs-csi-driver
#1140: Someone who was missing KMS permissions and would deterministically see this error because the drive would fail to encrypt
#1036: Someone else who was missing KMS permissions

More info

I'm happy to get any more info you need to debug this. I think CloudTrail events stick around for 90 days and our own logs should stick around for a decent amount of time too.

/kind bug

Feb 29 '24 17:02 Sinjo

So I just got another two of these today, right at the same time and in the same Kubernetes cluster. What I found really interesting was that the logs showed them both pointing to a single EBS volume that was created for a different PVC:

E0304 11:46:12.516570       1 driver.go:155] "GRPC error" err=<
        rpc error: code = Internal desc = Could not create volume "pvc-foo": failed to get an available volume in EC2: InvalidVolume.NotFound: The volume 'vol-baz' does not exist.
                status code: 400, request id: [redacted]
 >
E0304 11:46:12.525748       1 driver.go:155] "GRPC error" err=<
        rpc error: code = Internal desc = Could not create volume "pvc-bar": failed to get an available volume in EC2: InvalidVolume.NotFound: The volume 'vol-baz' does not exist.
                status code: 400, request id: [redacted]
 >

That PVC shows as being deleted at 11:46:11 in CloudTrail, which fits the pattern from last time.

Mar 04 '24 15:03 Sinjo

Hi, thanks for the great writeup - we've been vaguely aware of this issue for some time and it's on our backlog to fix. In the meantime, deleting and recreating the PVC/VolumeSnapshot object that is affected should act as a workaround for the rare times you hit this issue.

Mar 15 '24 15:03 ConnorJC3

For us its happening all the time(may be 95/100), any updates would be appreciated

Apr 23 '24 18:04 vivek-siva

For us its happening all the time(may be 95/100), any updates would be appreciated

We're working on a fix for this issie. However, that failure rate is highly unusual and likely indicates your volume(s) are failing to create because of invalid parameters (for example, an invalid KMS key or one the driver doesn't have access to).

Apr 23 '24 18:04 ConnorJC3

aws-ebs-csi-driver
aws-ebs-csi-driver copied to clipboard

Possible race condition in EBS volume creation

Summary

Possible EBS volume ID confusion

Timeline

Software versions

Potentially related issues

More info

aws-ebs-csi-driver aws-ebs-csi-driver copied to clipboard

Possible race condition in EBS volume creation

Summary

Possible EBS volume ID confusion

Timeline

Software versions

Potentially related issues

More info

aws-ebs-csi-driver
aws-ebs-csi-driver copied to clipboard