aws-sdk-go-v2
aws-sdk-go-v2 copied to clipboard
[bug] sts get caller identity does not work immediately
Describe the bug
See: https://github.com/aws/aws-sdk-go-v2/discussions/2093
Expected Behavior
I expect the call to sts.GetCallerIdentity() to succeed immediately after authenticating/receiving new sts credentials.
Current Behavior
The calls to sts.GetCallerIdentity() seem to be inconsistently failing.
Reproduction Steps
See: https://github.com/aws/aws-sdk-go-v2/discussions/2093
Possible Solution
See: https://github.com/aws/aws-sdk-go-v2/discussions/2093
Additional Information/Context
No response
AWS Go SDK V2 Module Versions Used
Compiler and Version used
go version go1.21.6 linux/amd64
Operating System and version
ubuntu
@hjkatz --
This is ultimately a fault in server-side behavior. The IAM state change associated with retrieving credentials (via AssumeRole or whatever else) does not propagate immediately after credentials are returned.
Hi @hjkatz ,
Just to chime in as well. This is not an SDK issue and is not unique to SSO. What you are experiencing is a propagation delay that is solved with a retry. This is not an issue with the SDK itself but just a nature of a distributed system.
Thanks, Ran~
This issue is now closed. Comments on closed issues are hard for our team to see. If you need more assistance, please open a new issue that references this one.
Hi @hjkatz ,
Just to chime in as well. This is not an SDK issue and is not unique to SSO. What you are experiencing is a propagation delay that is solved with a retry. This is not an issue with the SDK itself but just a nature of a distributed system.
Thanks, Ran~
@RanVaknin Kindly review the linked discussion. Retrying 5 times with a gradual backoff from 500ms to 2s still reproduces the issue. (I can provide a mvp example if desired too.)
I understand that a distributed system will require time to propogate the new token. So I'll start with my goal instead: I want to verify that the SSO session via the SDK is valid. How can I test that reliably?
@RanVaknin I realize that the discussion doesn't have as clear of an example as I'm suggesting. I'll upload a clear example later today.
// Wrapper for sts.GetCallerIdentity() that supports retries
//
// Use the context to set a maximum time that retries can be performed.
// When the context is canceled then the final response will be returned.
//
// See: https://github.com/aws/aws-sdk-go-v2/discussions/2093#discussioncomment-8455830
func StsGetCallerIdentity(ctx context.Context, client *sts.Client) (result *sts.GetCallerIdentityOutput, err error) {
if client == nil {
return nil, errs.New("cannot call StsGetCallerIdentity with nil client!")
}
attempt := 0
retries := 5
// internal lib that implements a backoff between start -> end, without jitter (false)
backoff := reliable.NewBackoff(100*time.Millisecond, 1*time.Second, false)
for attempt < retries {
attempt++
select {
case <-ctx.Done():
// ran out of time, return whatever we got last
return
default:
// continue below
}
result, err = client.GetCallerIdentity(ctx, &sts.GetCallerIdentityInput{})
if err == nil {
// success
return // final values
}
// See: https://github.com/aws/aws-sdk-go-v2/discussions/2093
if !strings.Contains(err.Error(), "api error InvalidClientTokenId") {
// non-retryable error
return // last result + error
}
// error, backoff and try again
be := backoff.Wait(ctx) // Wait() == time.Sleep(backoff.NextDuration())
if be != nil {
// context canceled
return // last values
}
}
return // last values
}
I hope this helps reproduce what we're seeing.
I'm also hopeful we can find a way via the SDK to verify the session credentials returned by the SSO credentials provider are valid. Some thoughts to get at this information:
- sts get-caller-identity (as above)
- reading a profile from
~/.aws/configthat depends onsso_session = my-sessionand see if it errors or not (relying on internalawsCLI's boto implementation) - reading refresh token from the SSO provider directly? (not yet tried) see: https://pkg.go.dev/github.com/aws/aws-sdk-go-v2/credentials/ssocreds
- Some new implementation in
ssocredsthat provides this ability
Happy to discuss alternatives too!
@hjkatz On average, how long does it take for GetCallerIdentity to recognize the token?
This really seems like it should be something that sts models as @waitable. That would enable automatic code generation of a waiter API, in every SDK, that polls GetCallerIdentity until a success response is returned. In the absence of that trait, you are forced to re-solve that problem, as you've basically implemented your own waiter above.
In case you're not familiar with waiters -- https://aws.github.io/aws-sdk-go-v2/docs/making-requests/#using-waiters.
The most ubiquitous example in my mind would be dynamodb.TableExistsWaiter, since ddb table creates are technically asynchronous.
tl;dr waiting for async state changes is a problem we've solved at large, if I understand correctly it's just a question of pushing for sts to add some additional modeling to solve this specific case
Followup - what is the delay between you provisioning the token (looks like through sso) and first calling GetCallerIdentity?
If it's on the order of seconds, then what I said above generally stands.
If this is like minutes or hours, that doesn't seem at all like acceptable behavior in the IAM sense and would warrant further investigation.
Followup - what is the delay between you provisioning the token (looks like through sso) and first calling
GetCallerIdentity?
It's on the order of milliseconds.
For context we have a shared developer CLI that everyone uses for various commands/tools/utilities. In that CLI we annotate some commands as needing SSO to work correctly. Many of our commands are annotated and interact with AWS in some required way.
Our goal was to warn the user that they have not started their SSO session for the running command. To do this we need to check if the session is valid (not sure how to do this), and we came up with generating a token then trying to see if it works.
If it's on the order of seconds, then what I said above generally stands.
For our use case I think the order of seconds is too long. It feels like a delay for our users interacting with a CLI. I would prefer milliseconds or some approach that does the bare minimum for testing that the SSO session is valid for generating an STS token or something like that.
Sorry, my last question there was incomplete in wording. I'm trying to understand what the actual delay is you're observing between provisioning the token and then getting a successful call to GetCallerIdentity.
I gotcha. I was writing up a test case to get some real data.
Summary:
1ms: 143323
100us: 856294
10ms: 374
100ms: 9
Here's the summary of 1 million attempts. It's looking much better today than when I originally opened the ticket. For whatever reason I'm not seeing anything take longer than ~100ms but in the past I would feel the delay more in the ~1-2s range, so maybe something's improved. (It could be my network as I'm at my parents' place atm)
How about I run this test again daily and get back to you with more data?
Today's summary also seems fine:
Summary:
10ms: 359
1ms: 141759
100us: 857875
100ms: 7
I'm suspecting that the way I'm using a backoff after getting the credentials should be a naive loop with context.WithTimeout() rather than some arbitrary number of attempts. Like maybe it takes 10ms, but that's 20 attempts, yet I'm exiting after 5 attempts. I'll keep tracking this.
Today seems the same too.
Summary:
10ms: 455
1ms: 160674
100us: 838859
100ms: 12
I'm going to add additional logging into our CLI and see if I can reproduce any inconsistency today, but my suspicion is that 5 attempts isn't enough and we just need to try more times.
My recommendation in general there would not be to fix the number of attempts and instead write your "waiting" construct to accept a timeout.
This is how modeled waiters are written (well, generated, but obviously we wrote the code to do that). The waiter then just retries "infinitely" (with an increasing backoff) until it either hits the success case or exceeds the caller-provided deadline.