cli icon indicating copy to clipboard operation
cli copied to clipboard

[aws-batch] refresh AWS SSO session tokens

Open joverlee521 opened this issue 7 months ago • 7 comments

Context on Slack.

If a user runs an AWS Batch job with a short-lived AWS SSO session token then their job must complete within the session duration. In cases where it's not possible to get long-lived credentials, it would be helpful to be able to refresh the session token that was initially used to run the nextstrain build command.

joverlee521 avatar Jun 16 '25 21:06 joverlee521

boto3 has support for AWS SSO credentials that can be specified via the AWS_PROFILE envvar. However, it seems like you must run the AWS CLI command aws sso login and re-authenticate in order to refresh the session tokens (source).

I don't see how we can support token refresh within an AWS Batch job...

joverlee521 avatar Jun 23 '25 20:06 joverlee521

Yeah, aws sso login requires browser input which prevents automatic refresh. I think it's safe to say that nextstrain build requires credentials to be valid for the full duration of the workflow run. We could make this more explicit by:

  1. Catching the ExpiredToken error and showing a custom message along the lines of:

    Nextstrain CLI cannot renew expired short-term AWS credentials. Please adjust your workflow to complete within the session duration configured by your AWS admin, or consult your AWS admin to provide long-term credentials.

  2. Adding a note in docs section on AWS credentials:

    Note: If you are using an AWS session token (AWS_SESSION_TOKEN/aws_session_token), workflows must complete within the session duration configured by your AWS admin.

victorlin avatar Jun 23 '25 21:06 victorlin

Yeah, I found AWS docs for authentication without a browser, but still depends on a person opening a link on a separate device to input a code...

Catching the ExpiredToken error and showing a custom message along the lines of:

👍 -- Briefly looked into this and I'm just realizing the final upload of the workdir is in docker-base/entrypoint-aws-batch

joverlee521 avatar Jun 23 '25 22:06 joverlee521

Sort of a drive-by comment but I found this linked in a search and I've been beating my head against the wall on this topic as well!

Based on this doc: https://docs.aws.amazon.com/singlesignon/latest/userguide/authconcept.html#sessionsconcept Depending on the configured expiration time for the aws sso login token which can be up to 90 days, the "application session" token (which I presume is what you're using in the batch process) only receives tokens that are valid for 1 hour (I assume this is a security measure so the session can be revoked without waiting a full 90 days).

So how do you refresh the tokens, you make a CreateToken call: https://docs.aws.amazon.com/singlesignon/latest/OIDCAPIReference/API_CreateToken.html

Here's the base url: "https://oidc.#{region}.amazonaws.com/token"

And here's an example of the payload (where the sso_cache is from ~/.aws/sso/cache):

payload = %{
  "grantType" => "refresh_token",
  "clientId" => sso_cache["clientId"],
  "clientSecret" => sso_cache["clientSecret"],
  "refreshToken" => sso_cache["refreshToken"]
}

I hope that helps!

What I'm trying to figure out now is if it is expected for tools to then update the ~/.aws/sso/cache file with the updated token but I haven't been able to find any documentation or discussion on that yet.

axelson avatar Jul 12 '25 04:07 axelson

Thanks for the info @axelson! Custom API calls to refresh the token won't work in our case because the AWS credentials are not managed by the process that's using them. Steps to reproduce:

  1. Configure AWS credentials in a terminal session through AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEY/AWS_SESSION_TOKEN env vars.
  2. Use nextstrain build to start a Batch job using the credentials.
  3. The credentials are automatically passed into the AWS Batch job, which fails to upload the workdir to S3 after the credentials have expired.

While writing this reply, I realized that (3) is the real issue, and there is an easy fix: use ~/.aws/credentials instead if (1) so that the compute environment uses the instance role that is configured for it, which is always available. This is briefly noted in Nextstrain CLI's AWS Batch setup docs:

Note that if you have AWS credentials set in environment variables on the local computer when running nextstrain build --aws-batch, then those will be passed into the job, where they’ll be used instead of the role by most libraries and utilities. If this is undesirable, you can unset the environment variables when launching builds or provision your local credentials via the standard files instead of environment variables.

Here's an updated plan:

  1. Catch the ExpiredToken error and show a custom message along the lines of:

    The job failed to complete due to expired short-term AWS credentials.
    To avoid this with future jobs, you have a few options.
    
    1. If your workflow does not use AWS credentials, switch from using AWS_*
       environment variables to using an ~/.aws/credentials file.
       This is the preferred method as it ensures that the job is run with a
       role that is properly configured to access the storage provider.
    
    2. If your workflow needs to use the AWS credentials, you have two options:
    
      i. Adjust your workflow to complete within the session duration configured
      by your AWS admin.
    
      ii. Consult your AWS admin to provide long-term credentials.
    
  2. Update the note in docs section on AWS credentials:

    This method is preferred for workflows that do not need AWS credentials within the job, as it ensures that the job is run with a role that is properly configured to access the storage provider. It is also useful because it does not require you to export the environment variables in every terminal where you want to use AWS.

victorlin avatar Jul 22 '25 17:07 victorlin

there is an easy fix: use ~/.aws/credentials instead if (1) so that the compute environment uses the instance role that is configured for it, which is always available

Ah, I've been stuck in the Nextstrain use case where we do need AWS credentials for the workflow, but this is totally true! 100% agree with your plan.

joverlee521 avatar Jul 23 '25 17:07 joverlee521

there is an easy fix: use ~/.aws/credentials instead

It's also possible to case-by-case unset the AWS credentials vars using CLI options. For example, by setting to blank values (which the AWS CLI ignores):

--env=AWS_{ACCESS_KEY_ID,SECRET_ACCESS_KEY,SESSION_TOKEN}=

or using empty files in an envdir to cause the var to be entirely removed from the environment:

$ mkdir /tmp/no-aws
$ touch /tmp/no-aws/AWS_{ACCESS_KEY_ID,SECRET_ACCESS_KEY,SESSION_TOKEN}
$ ls -l /tmp/no-aws
total 0
-rw-rw-r-- 1 tom tom 0 Sep  3 14:14 AWS_ACCESS_KEY_ID
-rw-rw-r-- 1 tom tom 0 Sep  3 14:14 AWS_SECRET_ACCESS_KEY
-rw-rw-r-- 1 tom tom 0 Sep  3 14:14 AWS_SESSION_TOKEN
$ nextstrain build … --envdir /tmp/no-aws …

tsibley avatar Sep 03 '25 21:09 tsibley