aws-sdk-go-v2 icon indicating copy to clipboard operation
aws-sdk-go-v2 copied to clipboard

[bug] sts get caller identity does not work immediately

Open hjkatz opened this issue 1 year ago • 12 comments

Describe the bug

See: https://github.com/aws/aws-sdk-go-v2/discussions/2093

Expected Behavior

I expect the call to sts.GetCallerIdentity() to succeed immediately after authenticating/receiving new sts credentials.

Current Behavior

The calls to sts.GetCallerIdentity() seem to be inconsistently failing.

Reproduction Steps

See: https://github.com/aws/aws-sdk-go-v2/discussions/2093

Possible Solution

See: https://github.com/aws/aws-sdk-go-v2/discussions/2093

Additional Information/Context

No response

AWS Go SDK V2 Module Versions Used

Compiler and Version used

go version go1.21.6 linux/amd64

Operating System and version

ubuntu

hjkatz avatar Feb 26 '24 15:02 hjkatz

@hjkatz --

This is ultimately a fault in server-side behavior. The IAM state change associated with retrieving credentials (via AssumeRole or whatever else) does not propagate immediately after credentials are returned.

lucix-aws avatar Mar 05 '24 16:03 lucix-aws

Hi @hjkatz ,

Just to chime in as well. This is not an SDK issue and is not unique to SSO. What you are experiencing is a propagation delay that is solved with a retry. This is not an issue with the SDK itself but just a nature of a distributed system.

Thanks, Ran~

RanVaknin avatar Mar 05 '24 19:03 RanVaknin

This issue is now closed. Comments on closed issues are hard for our team to see. If you need more assistance, please open a new issue that references this one.

github-actions[bot] avatar Mar 05 '24 19:03 github-actions[bot]

Hi @hjkatz ,

Just to chime in as well. This is not an SDK issue and is not unique to SSO. What you are experiencing is a propagation delay that is solved with a retry. This is not an issue with the SDK itself but just a nature of a distributed system.

Thanks, Ran~

@RanVaknin Kindly review the linked discussion. Retrying 5 times with a gradual backoff from 500ms to 2s still reproduces the issue. (I can provide a mvp example if desired too.)

I understand that a distributed system will require time to propogate the new token. So I'll start with my goal instead: I want to verify that the SSO session via the SDK is valid. How can I test that reliably?

hjkatz avatar Mar 05 '24 19:03 hjkatz

@RanVaknin I realize that the discussion doesn't have as clear of an example as I'm suggesting. I'll upload a clear example later today.

hjkatz avatar Mar 05 '24 19:03 hjkatz

// Wrapper for sts.GetCallerIdentity() that supports retries
//
// Use the context to set a maximum time that retries can be performed.
// When the context is canceled then the final response will be returned.
//
// See: https://github.com/aws/aws-sdk-go-v2/discussions/2093#discussioncomment-8455830
func StsGetCallerIdentity(ctx context.Context, client *sts.Client) (result *sts.GetCallerIdentityOutput, err error) {
    if client == nil {
        return nil, errs.New("cannot call StsGetCallerIdentity with nil client!")
    }

    attempt := 0
    retries := 5
    // internal lib that implements a backoff between start -> end, without jitter (false)
    backoff := reliable.NewBackoff(100*time.Millisecond, 1*time.Second, false)

    for attempt < retries {
        attempt++

        select {
        case <-ctx.Done():
            // ran out of time, return whatever we got last
            return
        default:
            // continue below
        }

        result, err = client.GetCallerIdentity(ctx, &sts.GetCallerIdentityInput{})
        if err == nil {
            // success
            return // final values
        }

        // See: https://github.com/aws/aws-sdk-go-v2/discussions/2093
        if !strings.Contains(err.Error(), "api error InvalidClientTokenId") {
            // non-retryable error
            return // last result + error
        }

        // error, backoff and try again
        be := backoff.Wait(ctx) // Wait() == time.Sleep(backoff.NextDuration())
        if be != nil {
            // context canceled
            return // last values
        }
    }

    return // last values
}

I hope this helps reproduce what we're seeing.

I'm also hopeful we can find a way via the SDK to verify the session credentials returned by the SSO credentials provider are valid. Some thoughts to get at this information:

  • sts get-caller-identity (as above)
  • reading a profile from ~/.aws/config that depends on sso_session = my-session and see if it errors or not (relying on internal aws CLI's boto implementation)
  • reading refresh token from the SSO provider directly? (not yet tried) see: https://pkg.go.dev/github.com/aws/aws-sdk-go-v2/credentials/ssocreds
  • Some new implementation in ssocreds that provides this ability

Happy to discuss alternatives too!

hjkatz avatar Mar 06 '24 14:03 hjkatz

@hjkatz On average, how long does it take for GetCallerIdentity to recognize the token?

This really seems like it should be something that sts models as @waitable. That would enable automatic code generation of a waiter API, in every SDK, that polls GetCallerIdentity until a success response is returned. In the absence of that trait, you are forced to re-solve that problem, as you've basically implemented your own waiter above.

In case you're not familiar with waiters -- https://aws.github.io/aws-sdk-go-v2/docs/making-requests/#using-waiters.

The most ubiquitous example in my mind would be dynamodb.TableExistsWaiter, since ddb table creates are technically asynchronous.

tl;dr waiting for async state changes is a problem we've solved at large, if I understand correctly it's just a question of pushing for sts to add some additional modeling to solve this specific case

lucix-aws avatar Mar 06 '24 15:03 lucix-aws

Followup - what is the delay between you provisioning the token (looks like through sso) and first calling GetCallerIdentity?

If it's on the order of seconds, then what I said above generally stands.

If this is like minutes or hours, that doesn't seem at all like acceptable behavior in the IAM sense and would warrant further investigation.

lucix-aws avatar Mar 06 '24 15:03 lucix-aws

Followup - what is the delay between you provisioning the token (looks like through sso) and first calling GetCallerIdentity?

It's on the order of milliseconds.

For context we have a shared developer CLI that everyone uses for various commands/tools/utilities. In that CLI we annotate some commands as needing SSO to work correctly. Many of our commands are annotated and interact with AWS in some required way.

Our goal was to warn the user that they have not started their SSO session for the running command. To do this we need to check if the session is valid (not sure how to do this), and we came up with generating a token then trying to see if it works.

If it's on the order of seconds, then what I said above generally stands.

For our use case I think the order of seconds is too long. It feels like a delay for our users interacting with a CLI. I would prefer milliseconds or some approach that does the bare minimum for testing that the SSO session is valid for generating an STS token or something like that.

hjkatz avatar Mar 06 '24 15:03 hjkatz

Sorry, my last question there was incomplete in wording. I'm trying to understand what the actual delay is you're observing between provisioning the token and then getting a successful call to GetCallerIdentity.

lucix-aws avatar Mar 06 '24 15:03 lucix-aws

I gotcha. I was writing up a test case to get some real data.

Summary:
1ms: 143323
100us: 856294
10ms: 374
100ms: 9

Here's the summary of 1 million attempts. It's looking much better today than when I originally opened the ticket. For whatever reason I'm not seeing anything take longer than ~100ms but in the past I would feel the delay more in the ~1-2s range, so maybe something's improved. (It could be my network as I'm at my parents' place atm)

How about I run this test again daily and get back to you with more data?

hjkatz avatar Mar 06 '24 16:03 hjkatz

Today's summary also seems fine:

Summary:
10ms: 359
1ms: 141759
100us: 857875
100ms: 7

I'm suspecting that the way I'm using a backoff after getting the credentials should be a naive loop with context.WithTimeout() rather than some arbitrary number of attempts. Like maybe it takes 10ms, but that's 20 attempts, yet I'm exiting after 5 attempts. I'll keep tracking this.

hjkatz avatar Mar 07 '24 13:03 hjkatz

Today seems the same too.

Summary:
10ms: 455
1ms: 160674
100us: 838859
100ms: 12

I'm going to add additional logging into our CLI and see if I can reproduce any inconsistency today, but my suspicion is that 5 attempts isn't enough and we just need to try more times.

hjkatz avatar Mar 11 '24 15:03 hjkatz

My recommendation in general there would not be to fix the number of attempts and instead write your "waiting" construct to accept a timeout.

This is how modeled waiters are written (well, generated, but obviously we wrote the code to do that). The waiter then just retries "infinitely" (with an increasing backoff) until it either hits the success case or exceeds the caller-provided deadline.

lucix-aws avatar Mar 11 '24 16:03 lucix-aws