community icon indicating copy to clipboard operation
community copied to clipboard

๐Ÿ› Feature request: improve authentication error diagnostics (IRSA vs Pod Identity)

Open gecube opened this issue 3 months ago โ€ข 4 comments

Hello team,

I encountered the following error when running the CloudWatch Logs controller:

{"level":"error","ts":"2025-11-09T13:35:17.149Z","logger":"setup","msg":"Unable to create controller manager","aws.service":"cloudwatchlogs","error":"unable to determine account ID: unable to get caller identity: operation error STS: GetCallerIdentity, get identity: get credentials: failed to refresh cached credentials, failed to load credentials, : [43250e5c-8c9c-4fe8-af41-34d48103435b]: (AccessDeniedException): Unauthorized Exception! EKS does not have permissions to assume the associated role., fault: client","stacktrace":"main.main\n\t/github.com/aws-controllers-k8s/cloudwatchlogs-controller/cmd/controller/main.go:77\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:285"}

What actually happened

The error message is not very informative.
In my case, the ServiceAccount had an IRSA annotation, but the controller was actually using EKS Pod Identity under the hood.
Naturally, the IAM role did not include a trust relationship for Pod Identity โ€” which resulted in the AccessDeniedException.

Because of this, I initially spent time debugging IRSA, while the actual problem was with Pod Identity being used instead.

Why this matters

Debugging such issues (IRSA vs Pod Identity) is quite painful and time-consuming.
The current error message only says โ€œEKS does not have permissions to assume the associated role,โ€ but doesnโ€™t clarify which authentication mechanism was in use.

Having that context would immediately point users in the right direction.

Expected behavior

It would be great if the controller:

  1. Explicitly logged which authentication mechanism is being used (e.g. Using Pod Identity, Using IRSA (web identity)), ideally at the INFO level during startup.
  2. When receiving AccessDeniedException, the error message included hints on where to look:
    • For Pod Identity: check ServiceAccount โ†” Pod Identity Association and IAM role trust policy.
    • For IRSA: check eks.amazonaws.com/role-arn annotation, OIDC provider, and trust policy.

Suggested improvements

  • Add an explicit log message showing which credential provider/mechanism is active (IRSA / Pod Identity / static / env).
  • In the STS GetCallerIdentity error handler, enrich the error text with a note about the active mechanism and recommended checks (trust policy, associations, etc.).
  • (Optional) Expose this information via metrics or /healthz endpoint for quick diagnostics.

Rationale

This would make debugging IAM-related startup issues much faster and prevent users from โ€œchasingโ€ the wrong mechanism โ€” especially when IRSA and Pod Identity configurations coexist.

Thank you!

gecube avatar Nov 09 '25 13:11 gecube

Hello @gecube ๐Ÿ‘‹ Thank you for opening an issue in ACK! A maintainer will triage this issue soon.

We encourage community contributions, so if you're interested in tackling this yourself or suggesting a solution, please check out our Contribution and Code of Conduct guidelines.

You can find more information about ACK on our website.

github-actions[bot] avatar Nov 09 '25 13:11 github-actions[bot]

Removing pod identity from AWS console solved my issue, but some others may be stuck with the same or almost the same situation

gecube avatar Nov 09 '25 13:11 gecube

This is functionality specific to the AWS sdk, and not anything specific to ACK itself. AWS tooling has standard mechanisms to define credential sources: https://docs.aws.amazon.com/sdkref/latest/guide/standardized-credentials.html#credentialProviderChain

We might be able to introspect into the AWS sdk, but I'm not sure this information is even available to us, since this is logic that's implemented by the AWS SDK tooling itself.

cheeseandcereal avatar Nov 10 '25 19:11 cheeseandcereal

@cheeseandcereal Hi! Thanks for the explanation, but then it means that the issue should be duplicated in aws library. I have no power to do it, but you may be would be able to better explain that such absence of debug messages and clarity are real PITA.

gecube avatar Nov 10 '25 19:11 gecube