iam-manager
iam-manager copied to clipboard
Race condition exists on rapid delete/create of Iamroles, or when IAM Role already exists in AWS
Is this a BUG REPORT or FEATURE REQUEST?:
This is a bug report, with a fix at https://github.com/keikoproj/iam-manager/pull/82.
What happened:
We began running into a situation where our ServiceAccount
annotations would be set with eks.amazonaws.com/role-arn: ""
. After digging into the code, we discovered that the default
case handler in the controller has this if
statement that is problematic. I think its intent was to handle an intermittent failure in the READY
handler, where the ServiceAccount
was already setup properly... but what it fails to do is handle the situation where the IAMRole already exists.
Here is the rundown of events that can cause this failure:
- A new
Iamrole
is created.. and the controller reconciliation starts. - The
r.IAMClient.CreateRole
function is called.- The function sets
roleAlreadyExists=true
and logs the error. - Because
roleAlreadyExists
, the code then skips populating theIAMRoleResponse
object.
- The function sets
- The handler checks if
resp.RoleARN
is set. It is not because we got an empty response object. - The handler tries to use the previously known good state in
iamRole.Status.RoleARN
... but remember, this is a newly createdIamrole
resource, so that is empty too. - The
ServiceAccount
is populated with an empty string annotation.
What you expected to happen:
I expect the code to realize that the IAM Role in AWS is indeed the one we are trying to use (likely previously created but perhaps the reconciliation loop did not complete, OR a fast create/delete was called on the Iamrole resource by a tool like ArgoCD, or any other number of intermittent situations). Either way, i expect eventual consistency to work out the issue.
How to reproduce it (as minimally and precisely as possible):
Pre-create an IAM Role... then try creating an Iamrole resource.
Anything else we need to know?:
While digging into this, I also noticed that there is no reconciliation that happens for the ServiceAccount
annotation. I have a second PR (prepped at https://github.com/Nextdoor/keikoproj-iam-manager/pull/4) that significantly revamps the k8s/rbac.go
package to be more testable, and implements regular reconciliation of the ServiceAccount
annotation. We've been running the combination of these two in production now for 2 weeks and we've had great success.