aws-sdk-rust icon indicating copy to clipboard operation
aws-sdk-rust copied to clipboard

Env vars to configure IMDS retry and timeouts

Open kevinpark1217 opened this issue 3 years ago • 11 comments
trafficstars

Describe the feature

Other official AWS SDK libraries support specifying AWS_METADATA_SERVICE_NUM_ATTEMPTS and AWS_METADATA_SERVICE_TIMEOUT environment variables to automatically retry IMDS requests.

This feature is currently missing in the aws-sdk-rust making it more difficult for applications to handle rare credential failures originating from IMDS requests.

Use Case

When applications are deployed in Kubernetes cluster with KIAM project intercepting and redirecting IMDS requests, it can be flaky.

Applications such as Vector will out right abort and throw-away the current operation when it encounters an IMDS credential error. It would be super beneficial to have the retry abilities built-in to the SDK itself.

Proposed Solution

Implement retry logic into the library with AWS_METADATA_SERVICE_NUM_ATTEMPTS and AWS_METADATA_SERVICE_TIMEOUT environment variables support.

Other Information

No response

Acknowledgements

  • [x] I may be able to implement this feature request
  • [ ] This feature might incur a breaking change

A note for the community

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue, please leave a comment

kevinpark1217 avatar Sep 30 '22 19:09 kevinpark1217

Like @kevinpark1217 mentioned our calls to KIAM are proxied by KIAM, and while KIAM is in maintenance mode we're stuck on it for the time being.

It appears like the responsiveness of IMDS can depend on load, so supporting these env vars could help users increase reliability when something running the SDK is sharing an instance with a caller that is hitting IMDS hard.

I think it's possible to adjust the timeout in code, but supporting the env vars would make it easy for everyone to tweak these settings for all the downstream applications of the SDK.

unkempthenry avatar Sep 30 '22 19:09 unkempthenry

It looks like the Go SDK also has this feature request: https://github.com/aws/aws-sdk-go/issues/3495

Seems like these env vars need to be standardized across the SDKs.

jdisanti avatar Sep 30 '22 20:09 jdisanti

looks like the support across the different SDKs is tracked here https://docs.aws.amazon.com/sdkref/latest/guide/feature-ec2-instance-metadata.html

unkempthenry avatar Sep 30 '22 23:09 unkempthenry

I would be interested in working on this, couple of clarifying questions to the team:

  1. ec2 metadata page linked above defines default max_retries should be 1 but current value is 4, should this be changed to 1 to be consistent cross SDKs?
  2. Does the AWS_METADATA_SERVICE_TIMEOUT represent both connection and read timeout? i.e. if set to 5, connection timeout would be set to 5 and read timeout would be set to 5?
  3. builder already exposes connection_timeout read_timeout functions but does not use those values anywhere, how do you see the single timeout environment configuration interacting with those two configurations?
    • I'm thinking I would take on fixing this in the PR as well.

Sigurthorb avatar Oct 02 '22 14:10 Sigurthorb

@Sigurthorb Oops, I didn't see your comment on taking on this issue. I took a stab at it with #626 PR.

should this be changed to 1 to be consistent cross SDKs?

I think I should probably change this to 1 for consistency. Let me know if anyone else has opinions on this.

My question for the reviewers, what's the standard way of handling parsing error in the AWS sdk?

  1. Fail outright and terminate
  2. Display warning, but use default values
  3. Use default values silently

kevinpark1217 avatar Oct 03 '22 02:10 kevinpark1217

You technically called first dibs with your checkmark to I may be able to implement this feature request and you reported it. Good luck!

Sigurthorb avatar Oct 03 '22 23:10 Sigurthorb

@Sigurthorb Oops, I didn't see your comment on taking on this issue. I took a stab at it with #626 PR.

should this be changed to 1 to be consistent cross SDKs?

I think I should probably change this to 1 for consistency. Let me know if anyone else has opinions on this.

My question for the reviewers, what's the standard way of handling parsing error in the AWS sdk?

1. Fail outright and terminate

2. Display warning, but use default values

3. Use default values silently

When encountering parsing errors, we typically disregard the config source that produced them and move on to other config sources in the provider chain. If all configs are invalid, then we will either set a default or panic depending on the specific config variable.

Velfi avatar Oct 04 '22 17:10 Velfi

@Velfi Sorry for taking a while to follow up on this. I was preparing the slides for the AWS reInvent presentation.

I have opened a new PR in the smithy-rs repo. Based on your error handling suggestion, I simplified the code a little bit. For this initial PR, I left the EndpointSource struct untouched for code readability and initial review.

I think it would be good to implement this feature, since Python SDK already supports this. But also, the IMDS retry isn't well exposed for programmatically configuring by other parts of the Rust SDK.

https://github.com/awslabs/smithy-rs/pull/1821

kevinpark1217 avatar Oct 06 '22 22:10 kevinpark1217

Also, there appears to be a regression where self.read_timeout and self.connect_timeout fields is now unused in the imds/client.rs

Checkout here. (takes a while to auto scroll to the correct position)

Edit: Opened an issue and a PR (https://github.com/awslabs/smithy-rs/issues/1822)

kevinpark1217 avatar Oct 06 '22 22:10 kevinpark1217

I have found another way to implement timeout and retry attempts through code rather than environment variables. Please check out https://github.com/awslabs/smithy-rs/pull/1867

kevinpark1217 avatar Oct 18 '22 02:10 kevinpark1217

@kevinpark1217 sorry for the slowness, we're still working on getting the next version released.

Velfi avatar Oct 19 '22 14:10 Velfi

@kevinpark1217 - Your changes to override the IMDS client have been released with release-2022-10-26.

I'm going to leave this feature request open for now to track interest in this configuration as environment variables.

jdisanti avatar Oct 27 '22 23:10 jdisanti