iam-manager icon indicating copy to clipboard operation
iam-manager copied to clipboard

Race condition exists on rapid delete/create of Iamroles, or when IAM Role already exists in AWS

Open diranged opened this issue 3 years ago • 0 comments

Is this a BUG REPORT or FEATURE REQUEST?:

This is a bug report, with a fix at https://github.com/keikoproj/iam-manager/pull/82.

What happened:

We began running into a situation where our ServiceAccount annotations would be set with eks.amazonaws.com/role-arn: "". After digging into the code, we discovered that the default case handler in the controller has this if statement that is problematic. I think its intent was to handle an intermittent failure in the READY handler, where the ServiceAccount was already setup properly... but what it fails to do is handle the situation where the IAMRole already exists.

Here is the rundown of events that can cause this failure:

  1. A new Iamrole is created.. and the controller reconciliation starts.
  2. The r.IAMClient.CreateRole function is called.
    1. The function sets roleAlreadyExists=true and logs the error.
    2. Because roleAlreadyExists, the code then skips populating the IAMRoleResponse object.
  3. The handler checks if resp.RoleARN is set. It is not because we got an empty response object.
  4. The handler tries to use the previously known good state in iamRole.Status.RoleARN ... but remember, this is a newly created Iamrole resource, so that is empty too.
  5. The ServiceAccount is populated with an empty string annotation.

What you expected to happen:

I expect the code to realize that the IAM Role in AWS is indeed the one we are trying to use (likely previously created but perhaps the reconciliation loop did not complete, OR a fast create/delete was called on the Iamrole resource by a tool like ArgoCD, or any other number of intermittent situations). Either way, i expect eventual consistency to work out the issue.

How to reproduce it (as minimally and precisely as possible):

Pre-create an IAM Role... then try creating an Iamrole resource.

Anything else we need to know?:

While digging into this, I also noticed that there is no reconciliation that happens for the ServiceAccount annotation. I have a second PR (prepped at https://github.com/Nextdoor/keikoproj-iam-manager/pull/4) that significantly revamps the k8s/rbac.go package to be more testable, and implements regular reconciliation of the ServiceAccount annotation. We've been running the combination of these two in production now for 2 weeks and we've had great success.

diranged avatar Jul 12 '21 14:07 diranged