iam-controller is not able to delete created IAM role
Describe the bug
The IAM role isn't deleted after removing the role manifest for iam-controller 1.2.3. Below is the error message:
Message: DeleteConflict: Cannot delete entity, must delete policies first. status code: 409, request id: ebc1f8a1-68ed-4fbc-baed-572c3de80960
Steps to reproduce
- create iam role manifest with inlinePolicies
- create iam role from step 1
- delete the role manifest
Expected outcome A concise description of what you expected to happen. The iam role should be deleted after removing the role manifest
Environment
- Kubernetes version 1.24
- Using EKS (yes/no), if so version? eks.10
- AWS service targeted (S3, RDS, etc.)
Hi @sylin218 can you share an example CR of the resource you're trying to create/delete?
Hi @sylin218 !
Please give more particular example, as I checked multiple times and did not have complaints regarding this function...
Hey since the github doesn't support .yaml, I change the manifest extension to txt. Also I hide some sensitive info.
Hey @gecube @a-hilaly do we have any updates here?
@sylin218 Hi! Please don't be confused :-) I am not developer of ACK, but a little bit passionated guy :-) I will be able to reproduce your issue (at least try to do it) in next few days.
Issues go stale after 180d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 60d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Provide feedback via https://github.com/aws-controllers-k8s/community.
/lifecycle stale
/remove-lifecycle stale
@sylin218 @gecube can either of you confirm that indeed this is still a bug with the latest version of the iam-controller?
Issues go stale after 180d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 60d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Provide feedback via https://github.com/aws-controllers-k8s/community.
/lifecycle stale
/remove-lifecycle stale
Issues go stale after 180d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 60d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Provide feedback via https://github.com/aws-controllers-k8s/community.
/lifecycle stale
/remove-lifecycle stale
@sylin218 @gecube can either of you confirm that indeed this is still a bug with the latest version of the iam-controller?
@gecube is this still a bug?
@michaelhtm Hi! Thanks for the reminder.
So I am checking, what's going on.
First observation:
there is one role attached as EC2 instance profile. And as EC2 instances still exist I can not remove the role:
{"level":"error","ts":"2025-04-25T06:45:00.738Z","msg":"Reconciler error","controller":"role","controllerGroup":"iam.services.k8s.aws","controllerKind":"Role","Role":{"name":"ec2-ledger","namespace":"infra-production"},"namespace":"infra-production","name":"ec2-ledger","reconcileID":"3b40209c-9b62-42a1-8348-aacbaee65044","error":"operation error IAM: DeleteRole, https response error StatusCode: 409, RequestID: eef2fdca-b01b-4568-a7c5-526396d1774b, DeleteConflict: Cannot delete entity, must remove roles from instance profile first.","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:347\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:294\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:255"}
it looks like limitation of Amazon API, so we could do nothing good with it, the only option would be to emit a proper metric with an alert.
And yes, I see the proper condition in the status field of Kind: Role
...
- message: 'DeleteConflict: Cannot delete entity, must remove roles from instance profile first.'
status: 'True'
type: ACK.Recoverable
...
Second observation.
It looks like that I have an incorrect role referring to an inexistent policy in dev environment:
{"level":"error","ts":"2025-04-25T06:47:36.802Z","msg":"Reconciler error","controller":"role","controllerGroup":"iam.services.k8s.aws","controllerKind":"Role","Role":{"name":"teleport-role","namespace":"infra-dev"},"namespace":"infra-dev","name":"teleport-role","reconcileID":"292760ce-e56a-40c0-afb8-9751f8cc89f2","error":"operation error IAM: CreateRole, https response error StatusCode: 404, RequestID: 0a850a3d-31c8-401f-a9b0-a4e1485cd306, api error NoSuchEntity: Scope ARN: arn:aws:iam::474417630776:policy/DatabaseDiscoveryBoundary does not exist or is not attachable.","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:347\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:294\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:255"}
Again - the condition is set properly:
...
conditions:
- message: 'api error NoSuchEntity: Scope ARN: arn:aws:iam::474417630776:policy/DatabaseDiscoveryBoundary does not exist or is not attachable.'
status: 'True'
type: ACK.Recoverable
- lastTransitionTime: '2025-04-25T06:48:59Z'
message: Unable to determine if desired resource state matches latest observed state
reason: 'operation error IAM: CreateRole, https response error StatusCode: 404, RequestID: 9bafa841-084c-4f49-ad0f-2c13ec984b55, api error NoSuchEntity: Scope ARN: arn:aws:iam::474417630776:policy/DatabaseDiscoveryBoundary does not exist or is not attachable.'
status: Unknown
type: ACK.ResourceSynced
...
It is definitely my error as when copying manifests between catalogues, I forgot to change the amazon account ID. Probably we should find some better way to refer the objects in the same account than to rewrite JSON from a scratch. I would be grateful if we could find this approach. For now I decided just to kill these role and associated resources.
Third observation.
Regarding the original topic. I removed around ~20 different roles in different accounts and I could confirm that now it looks like that the original issue was resolved.
Hey @gecube , from the response i can gather that the deletion of roles is working as expected.
For the second observation where you mention that you had to rewrite JSON from scratch and were looking for a better approach. Would you mind sharing your manifest? The Prolicy can be defined as resources and then it can be referenced. That should eliminate having to rewrite it.
@rushmash91 Hi! Thanks for your comment. I think that when you define Policy like a different Kind:
first of all, it does not solve issue that I want to feed the policy like a direct json file from k8s repo
secondly, it creates unnecessary objects in Amazon Cloud, which I need to manage and care about.
What really would be cool - if I could put policy like a file into secret or configmap and reference it like
...
policyRef:
- kind: ConfigMap
name: my-great-policy
key: policy.json
...
because it is very easy to deploy a plain config files with kustomize to k8s. Or otherwise I would need to write a helm chart and wrap an IAM Role
Anyway I don't like a separate Policy because of delays between cycles inside the IAM controller and all other controllers. I observed the next behaviour : if you create an EKS cluster with EKS controller and some of the resources referenced are not ready, the cluster is not created until all prerequisite resources are ready. And it really could take a loooooong time. During which the controller constantly spams a dozen of nonsense messages.
Issues go stale after 180d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 60d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Provide feedback via https://github.com/aws-controllers-k8s/community.
/lifecycle stale
/remove-lifecycle stale