aws-sdk-java-v2 icon indicating copy to clipboard operation
aws-sdk-java-v2 copied to clipboard

InstanceProfileCredentialsProvider unable to refresh/recover from throttling or network problems at time of credential cache refresh

Open steveloughran opened this issue 8 months ago • 4 comments

Describe the bug

The full details and analysis are in HADOOP-19181. IAMCredentialsProvider throttle failures

  • the credentials retrieved by IAMCredentialsProvider are set to expire() only 1s before the expiry time returned by the Instance metadata service
  • this is not enough time for the CachedSupplier to recover from any failure as on a failure it sets a jitter on a retry to ComparableUtils.minimum(Duration.ofMillis(exponentialBackoffMillis), Duration.ofSeconds(10)). That is: up to 9 seconds after the credentials expire.

It's notable that ContainerCredentialsProvider has more resilience in

  • credentials are set to expire 15 minutes before the credentials become invalid
  • it also retries 5 times with no sleep on a request failure.

Expected Behavior

IAMCredentialsProvider to always issue valid credentials. This should include asynchronous refreshing of credentials when enabled, far enough in advance of their expiry that recovery attempts can be repeated.

Current Behavior

invalid credentials were returned when requesting credentials to sign a request.

Reproduction Steps

see the hadoop bug report. The deployment was a single EC2 node running many services, long enough for the EC2 credentials to expire and need refreshing.

Possible Solution

I really don't see what we can do in the hadoop codebase to recover from this. We could consider extending our own IAMInstanceCredentialsProvider to wrap the retry failures with our own sleep/retry, but as it'd take > 10 seconds, API calls awaiting signing (e.g. S3 Express CreateSession) will time out.

I'd propose copying ContainerCredentialsProvider, if not with the retries then at least the expiry time many minutes ahead of actual credential expiry

Additional Information/Context

No response

AWS Java SDK version used

2.24.6

JDK version used

an openjdk java 8 build

Operating System and version

linux

steveloughran avatar May 27 '24 18:05 steveloughran