aws-sdk-ruby icon indicating copy to clipboard operation
aws-sdk-ruby copied to clipboard

Aws::CloudWatch::Client doesn't automatically refresh credentials

Open Roy-Gal-Git opened this issue 11 months ago • 56 comments

Describe the bug

We use the sidekiq-cloudwatchmetrics gem to send Sidekiq-related metrics to CloudWatch. Occasionally, we encounter the following error:

Aws::CloudWatch::Errors::ExpiredToken: The security token included in the request is expired

Upon investigation, we found that the gem initializes an Aws::CloudWatch::Client, which uses Aws::InstanceProfileCredentials. These credentials are supposed to auto-refresh before expiration. However, the error indicates that the token is not being refreshed in time, causing the metrics to fail. Once the error starts, it repeats, and the metrics fail to recover.

Regression Issue

  • [ ] Select this option if this issue appears to be a regression.

Expected Behavior

The Aws::InstanceProfileCredentials should automatically refresh the security token before expiration (approximately 5 minutes before expiry) and ensure that the put_metric_data API call succeeds without errors. Metrics should continue to be sent to CloudWatch without interruption.

Current Behavior

The Aws::CloudWatch::Client raises Aws::CloudWatch::Errors::ExpiredToken errors when attempting to send metrics.

Once this error occurs:

  1. The metrics stop being sent to CloudWatch.
  2. The error persists until the process is restarted or manually handled.
  3. The auto-refresh mechanism in Aws::InstanceProfileCredentials does not appear to trigger correctly in this context.

Reproduction Steps

Configure the sidekiq-cloudwatchmetrics gem in a Sidekiq setup that uses an IAM role (e.g., on EC2 or ECS). You can find the configuration I used here: Allow the process to run for an extended period, relying on auto-refreshable credentials from Aws::InstanceProfileCredentials. Observe that after some time, the following error may appear: Aws::CloudWatch::Errors::ExpiredToken: The security token included in the request is expired

Possible Solution

No response

Additional Information/Context

Here are some useful links for more context regarding my problem:

  • The issue I opened in the sidekiq-cloudwatchmetrics repo https://github.com/sj26/sidekiq-cloudwatchmetrics/issues/48
  • The PR I opened to fix the problem in the sidekiq-cloudwatchmetrics repo https://github.com/sj26/sidekiq-cloudwatchmetrics/pull/49/files

Gem name ('aws-sdk', 'aws-sdk-resources' or service gems like 'aws-sdk-s3') and its version

aws-sdk-cloudwatch

Environment details (Version of Ruby, OS environment)

Ruby - 3.2.5, OS - Debian GNU/Linux 12 (bookworm), Docker image - ruby:3.2.5-slim, sidekiq-cloudwatchmetrics - 2.6.0

Roy-Gal-Git avatar Jan 07 '25 10:01 Roy-Gal-Git

You are correct that the InstanceProfileCredentials should be auto refreshing (if they are within 5 minutes of expiration).

We added "static stability" logic awhile back that allow expired credentials to be used after their expiration when the underlying credentials service isn't refreshing credentials on the hosts (eg: an outage). Are you seeing any logs like Attempting credential expiration extension due to a credential service availability issue?

Have you tested your PR that does the explicit credentials refresh - does that fix the issue for you?

alextwoods avatar Jan 07 '25 16:01 alextwoods

Hi @alextwoods,

I’m unable to reproduce the error, so I can’t verify whether the issue has been resolved. Even if my PR makes it to production, I won’t be able to confirm if it fixes the issue for me, as the error occurs randomly across different production environments.

Are you seeing any logs like Attempting credential expiration extension due to a credential service availability issue?

No, I didn't find any logs such as what you suggested. I tried looking for anything containing the words 'credential' and 'attempting' (besides the exact match of the full message you sent) but I didn't find anything either.

Roy-Gal-Git avatar Jan 07 '25 22:01 Roy-Gal-Git

This may be a naive question, but are you certain that the other gem is resolving instance profile credentials, and credentials are not set globally or in some other way? Reading your other thread, it looks like you tried creating a client on your own, and the credentials defaulted to instance profile credentials (which is last in the chain) but it's possible in your app in production, it is set to something else. I would actually output an instance of your credentials provider after deploying somehow.

mullermp avatar Jan 07 '25 22:01 mullermp

Hi @mullermp,

are you certain that the other gem is resolving instance profile credentials, and credentials are not set globally or in some other way?

The other gem's CloudWatch publisher requires a client argument, and users can determine how this client is configured. In my case, I explicitly pass Aws::CloudWatch::Client.new(region: REGION_PLACEHOLDER), which resolves to instance profile credentials in the production environment.

I would actually output an instance of your credentials provider after deploying somehow

I shared this in the other thread as well, but here’s the truncated output of @client.config (where @client is the CloudWatch client I pass to the gem’s publisher) from a production environment:

#<struct
	credentials=
		#<Aws::InstanceProfileCredentials:0x00007f78ee260660
			...
			@credentials=#<Aws::Credentials access_key_id="PLACEHOLDER_ACCESS_KEY_ID">,
			...>
	...
	secret_access_key=nil
	session_token=nil
	...>

This confirms that the client resolves to Aws::InstanceProfileCredentials, as expected in this setup.

Roy-Gal-Git avatar Jan 08 '25 08:01 Roy-Gal-Git

Yes I saw that. Try passing an instance of instance profile credentials explicitly to the client in your production code. Looking at the gem source, I don't think any patch is necessary. The SDK will already refresh and retry if necessary.

mullermp avatar Jan 08 '25 12:01 mullermp

@mullermp How come it doesn't already refresh and retry if it does use an instance of instance profile credentials?

Roy-Gal-Git avatar Jan 08 '25 13:01 Roy-Gal-Git

It does and should. You can verify that by fetching instance credentials from irb or pry from your host and inspect them after accessing them before expiration. How did you verify the credentials class in production? Using a rails console or print statement? Or through local testing? By passing in credentials explicitly, we can be sure that is what is used. What version of aws-sdk-cloudwatch and aws-sdk-core are you using?

mullermp avatar Jan 08 '25 13:01 mullermp

How did you verify the credentials class in production?

Rails console. I initialized an Aws::CloudWatch::Client object the way I do for thesidekiq-cloudwatchmetrics publisher and performed the checks on that object. You're saying the client credentials could differ when tested from rails console?

What version of aws-sdk-cloudwatch and aws-sdk-core are you using?

aws-sdk-cloudwatch (1.108.0) aws-sdk-core (3.214.0)

Roy-Gal-Git avatar Jan 08 '25 13:01 Roy-Gal-Git

They shouldn't differ. Trust but verify, but did you pass rails env as production in your console? Are you setting global configuration anywhere such as Aws.config[key]?

Otherwise I'm not sure what may be going on. I'm inclined to believe it's something with your environment because if those credentials don't refresh then there would be a lot more noise and our internal usage would stop working.

mullermp avatar Jan 08 '25 13:01 mullermp

did you pass rails env as production in your console?

We use a setup where the Rails environment name matches the server name (e.g., us, eu). The correct Rails env is automatically passed to the Rails console as an exported environment variable.

Are you setting global configuration anywhere such as Aws.config[key]?

No, we are not setting Aws.config[key] anywhere in our application.

if those credentials don't refresh then there would be a lot more noise and our internal usage would stop working.

Interestingly, the only component affected is the sidekiq-cloudwatchmetrics publisher. The app continues to function as expected, but metrics stop being sent to CloudWatch, which triggers an alarm after a short while.

Roy-Gal-Git avatar Jan 08 '25 14:01 Roy-Gal-Git

The app continues to function as expected, but metrics stop being sent to CloudWatch, which triggers an alarm after a short while.

Do you use other AWS services in your application? If so, its interesting that you would get expired token issues only in cloudwatch.

Do you know if you are using IMDS V1 or V2? And do you know if you've disabled V1 through configuration (example the AWS_EC2_METADATA_V1_DISABLED env variable)? There are some differences in the refresh process for credentials from V1 and V2.

alextwoods avatar Jan 08 '25 16:01 alextwoods

@alextwoods

Do you use other AWS services in your application?

Yes, S3, SSM, STS...

Do you know if you are using IMDS V1 or V2?

We're using IMDS V2.

And do you know if you've disabled V1 through configuration

Yes, IMDS V1 is disabled.

Roy-Gal-Git avatar Jan 09 '25 08:01 Roy-Gal-Git

I believe the backtrace of the error could provide valuable insights for the investigation:

/gems/aws-sdk-core-3.214.0/lib/seahorse/client/plugins/raise_response_errors.rb:17 in call
/gems/aws-sdk-core-3.214.0/lib/aws-sdk-core/plugins/checksum_algorithm.rb:111 in call
/gems/aws-sdk-core-3.214.0/lib/aws-sdk-core/plugins/jsonvalue_converter.rb:16 in call
/gems/aws-sdk-core-3.214.0/lib/aws-sdk-core/plugins/invocation_id.rb:16 in call
/gems/aws-sdk-core-3.214.0/lib/aws-sdk-core/plugins/idempotency_token.rb:19 in call
/gems/aws-sdk-core-3.214.0/lib/aws-sdk-core/plugins/param_converter.rb:26 in call
/gems/aws-sdk-core-3.214.0/lib/seahorse/client/plugins/request_callback.rb:89 in call
/gems/aws-sdk-core-3.214.0/lib/aws-sdk-core/plugins/response_paging.rb:12 in call
/gems/aws-sdk-core-3.214.0/lib/seahorse/client/plugins/response_target.rb:24 in call
/gems/aws-sdk-core-3.214.0/lib/aws-sdk-core/plugins/telemetry.rb:39 in block in call
/gems/aws-sdk-core-3.214.0/lib/aws-sdk-core/telemetry/no_op.rb:29 in in_span
/gems/aws-sdk-core-3.214.0/lib/aws-sdk-core/plugins/telemetry.rb:53 in span_wrapper
/gems/aws-sdk-core-3.214.0/lib/aws-sdk-core/plugins/telemetry.rb:39 in call
/gems/aws-sdk-core-3.214.0/lib/seahorse/client/request.rb:72 in send_request
/gems/aws-sdk-cloudwatch-1.108.0/lib/aws-sdk-cloudwatch/client.rb:3865 in put_metric_data
/gems/sidekiq-cloudwatchmetrics-2.6.0/lib/sidekiq/cloudwatchmetrics.rb:235 in block in publish
/gems/sidekiq-cloudwatchmetrics-2.6.0/lib/sidekiq/cloudwatchmetrics.rb:234 in each
/gems/sidekiq-cloudwatchmetrics-2.6.0/lib/sidekiq/cloudwatchmetrics.rb:234 in each_slice
/gems/sidekiq-cloudwatchmetrics-2.6.0/lib/sidekiq/cloudwatchmetrics.rb:234 in publish
/gems/sidekiq-cloudwatchmetrics-2.6.0/lib/sidekiq/cloudwatchmetrics.rb:77 in run
/gems/sidekiq-6.5.12/lib/sidekiq/component.rb:8 in watchdog
/gems/sidekiq-6.5.12/lib/sidekiq/component.rb:17 in block in safe_thread

Roy-Gal-Git avatar Jan 09 '25 15:01 Roy-Gal-Git

I don't think that backtrace is helpful. One thing you can try is to add http_wire_trace: true to your client to see wire logs, and perhaps even add the http_debug_output to an instance of the instance profile credentials. This might give clues on what requests are being made. Note that these do not redact your credentials so be careful with logging and sharing.

mullermp avatar Jan 10 '25 00:01 mullermp

Yeah, if possible adding http_debug_output to an instance of the instance profile credentials would be the most useful as otherwise theres no logging from the credential refresh. My best guess is that something unusual is going on with the refresh (maybe an error from the instance thats getting ignored?).

If you can't enable http_debug_output on an instance of the instance profile credentials that you're using to configure your cloudwatch client because of logging concerns, you could at least provide a before_refresh proc:

before_refresh = proc do |creds| # called with self
  # do some logging....
end
credentials = Aws::InstanceProfileCredentials.new(before_refresh: before_refresh). 
client = Aws::Cloudwatch::Client.new(credentials: credentials)

alextwoods avatar Jan 10 '25 00:01 alextwoods

Are the http_debug_output logs outputted using the default rails logger? If so, is there some sort of way for me to output those logs to a specific file?

Roy-Gal-Git avatar Jan 10 '25 12:01 Roy-Gal-Git

You can pass an instance of a logger (or technically anything that responds to <<) as the http_debug_output, eg:

credentials = Aws::InstanceProfileCredentials.new(http_debug_output: Rails.logger)

But again as a general warning - this will log the actual requests and responses which include credentials.

alextwoods avatar Jan 10 '25 16:01 alextwoods

Hey @alextwoods I had to verify it with our management team and sadly I can't do any interaction related to http_debug_output.

It seems like the current direction of resolving this issue is using the forked version of sidekiq-cloudwatchmetrics that contains my fix.

If I'll find that it does solve the issue, would it prove the need for a fix on your side?

Is there something else I can do to provide extra information?

Roy-Gal-Git avatar Jan 15 '25 12:01 Roy-Gal-Git

I think we would want proof and root cause before taking a patch. Your fix in the other gem is a brute force solution which just retries after refreshing. I am inclined to believe it is something with your setup or environment.

mullermp avatar Jan 15 '25 14:01 mullermp

@Roy-Gal-Git - yeah that makes sense. I'd recommend in that case at least adding a before_refresh proc that logs when refreshes are happening (see above: https://github.com/aws/aws-sdk-ruby/issues/3162#issuecomment-2581516054).

I do think it will be useful to know whether you're sidekiq patch does address the issue.

An additional possible cause - time drift on your instance - if the system clock is off, the refresh of credentials on in the instance itself will be incorrect. Are you running anything like chrony, Amazon time Sync service or some other NTP setup?

alextwoods avatar Jan 15 '25 16:01 alextwoods

@alextwoods

I do think it will be useful to know whether you're sidekiq patch does address the issue.

I see that the current label is response-requested but It'll probably take more than a week for me to gain new insights from our environments...

Are you running anything like chrony, Amazon time Sync service or some other NTP setup?

No, we don't use those.

Roy-Gal-Git avatar Jan 23 '25 09:01 Roy-Gal-Git

@mullermp @alextwoods I have 2 small updates:

  1. We found 2 more reoccurring errors that might be of the same root cause:
  • Aws::Errors::MissingCredentialsError: unable to sign request without credentials set Backtrace:
/gems/aws-sdk-core-3.212.0/lib/aws-sdk-core/plugins/sign.rb:119 in rescue in initialize
/gems/aws-sdk-core-3.212.0/lib/aws-sdk-core/plugins/sign.rb:108 in initialize
/gems/aws-sdk-core-3.212.0/lib/aws-sdk-core/plugins/sign.rb:33 in new
/gems/aws-sdk-core-3.212.0/lib/aws-sdk-core/plugins/sign.rb:33 in signer_for
/gems/aws-sdk-core-3.212.0/lib/aws-sdk-core/plugins/sign.rb:45 in call
/gems/aws-sdk-core-3.212.0/lib/aws-sdk-core/plugins/transfer_encoding.rb:27 in call
/gems/aws-sdk-core-3.212.0/lib/aws-sdk-core/plugins/helpful_socket_errors.rb:12 in call
/gems/aws-sdk-s3-1.169.0/lib/aws-sdk-s3/plugins/s3_signer.rb:53 in call
/gems/aws-sdk-s3-1.169.0/lib/aws-sdk-s3/plugins/redirects.rb:20 in call
/gems/aws-sdk-core-3.212.0/lib/aws-sdk-core/plugins/user_agent.rb:69 in call
/gems/aws-sdk-core-3.212.0/lib/aws-sdk-core/plugins/retry_errors.rb:365 in block in call
/gems/aws-sdk-core-3.212.0/lib/aws-sdk-core/plugins/user_agent.rb:60 in metric
/gems/aws-sdk-core-3.212.0/lib/aws-sdk-core/plugins/retry_errors.rb:385 in with_metric
/gems/aws-sdk-core-3.212.0/lib/aws-sdk-core/plugins/retry_errors.rb:365 in call
/gems/aws-sdk-s3-1.169.0/lib/aws-sdk-s3/plugins/md5s.rb:32 in call
/gems/aws-sdk-core-3.212.0/lib/aws-sdk-core/plugins/http_checksum.rb:20 in call
/gems/aws-sdk-core-3.212.0/lib/aws-sdk-core/plugins/endpoint_pattern.rb:30 in call
/gems/aws-sdk-core-3.212.0/lib/aws-sdk-core/plugins/checksum_algorithm.rb:137 in call
/gems/aws-sdk-core-3.212.0/lib/aws-sdk-core/plugins/request_compression.rb:94 in block in call
/gems/aws-sdk-core-3.212.0/lib/aws-sdk-core/plugins/request_compression.rb:104 in with_metric
/gems/aws-sdk-core-3.212.0/lib/aws-sdk-core/plugins/request_compression.rb:94 in call
/gems/aws-sdk-core-3.212.0/lib/aws-sdk-core/rest/content_type_handler.rb:27 in call
/gems/aws-sdk-s3-1.169.0/lib/aws-sdk-s3/plugins/express_session_auth.rb:50 in block in call
/gems/aws-sdk-s3-1.169.0/lib/aws-sdk-s3/plugins/express_session_auth.rb:56 in with_metric
/gems/aws-sdk-s3-1.169.0/lib/aws-sdk-s3/plugins/express_session_auth.rb:50 in call
/gems/aws-sdk-s3-1.169.0/lib/aws-sdk-s3/plugins/expect_100_continue.rb:23 in call
/gems/aws-sdk-s3-1.169.0/lib/aws-sdk-s3/plugins/bucket_name_restrictions.rb:21 in call
/gems/aws-sdk-core-3.212.0/lib/aws-sdk-core/rest/handler.rb:10 in call
/gems/aws-sdk-core-3.212.0/lib/aws-sdk-core/plugins/recursion_detection.rb:18 in call
/gems/aws-sdk-s3-1.169.0/lib/aws-sdk-s3/plugins/endpoints.rb:52 in block in call
/gems/aws-sdk-core-3.212.0/lib/aws-sdk-core/plugins/user_agent.rb:60 in metric
/gems/aws-sdk-s3-1.169.0/lib/aws-sdk-s3/plugins/endpoints.rb:66 in with_metrics
/gems/aws-sdk-s3-1.169.0/lib/aws-sdk-s3/plugins/endpoints.rb:52 in call
/gems/aws-sdk-core-3.212.0/lib/aws-sdk-core/plugins/endpoint_discovery.rb:84 in call
/gems/aws-sdk-core-3.212.0/lib/seahorse/client/plugins/endpoint.rb:46 in call
/gems/aws-sdk-core-3.212.0/lib/aws-sdk-core/plugins/param_validator.rb:26 in call
/gems/aws-sdk-core-3.212.0/lib/seahorse/client/plugins/raise_response_errors.rb:16 in call
/gems/aws-sdk-s3-1.169.0/lib/aws-sdk-s3/plugins/sse_cpk.rb:24 in call
/gems/aws-sdk-s3-1.169.0/lib/aws-sdk-s3/plugins/dualstack.rb:21 in call
/gems/aws-sdk-s3-1.169.0/lib/aws-sdk-s3/plugins/accelerate.rb:43 in call
/gems/aws-sdk-core-3.212.0/lib/aws-sdk-core/plugins/checksum_algorithm.rb:111 in call
/gems/aws-sdk-core-3.212.0/lib/aws-sdk-core/plugins/jsonvalue_converter.rb:16 in call
/gems/aws-sdk-core-3.212.0/lib/aws-sdk-core/plugins/invocation_id.rb:16 in call
/gems/aws-sdk-core-3.212.0/lib/aws-sdk-core/plugins/idempotency_token.rb:19 in call
/gems/aws-sdk-core-3.212.0/lib/aws-sdk-core/plugins/param_converter.rb:26 in call
/gems/aws-sdk-core-3.212.0/lib/seahorse/client/plugins/request_callback.rb:89 in call
/gems/aws-sdk-core-3.212.0/lib/aws-sdk-core/plugins/response_paging.rb:12 in call
/gems/aws-sdk-core-3.212.0/lib/seahorse/client/plugins/response_target.rb:24 in call
/gems/aws-sdk-core-3.212.0/lib/aws-sdk-core/plugins/telemetry.rb:39 in block in call
/gems/aws-sdk-core-3.212.0/lib/aws-sdk-core/telemetry/no_op.rb:29 in in_span
/gems/aws-sdk-core-3.212.0/lib/aws-sdk-core/plugins/telemetry.rb:53 in span_wrapper
/gems/aws-sdk-core-3.212.0/lib/aws-sdk-core/plugins/telemetry.rb:39 in call
/gems/aws-sdk-core-3.212.0/lib/seahorse/client/request.rb:72 in send_request
/gems/aws-sdk-s3-1.169.0/lib/aws-sdk-s3/client.rb:16571 in put_object
/gems/aws-sdk-s3-1.169.0/lib/aws-sdk-s3/file_uploader.rb:66 in block in put_object
/gems/aws-sdk-s3-1.169.0/lib/aws-sdk-s3/file_uploader.rb:55 in block in open_file
/gems/aws-sdk-s3-1.169.0/lib/aws-sdk-s3/file_uploader.rb:55 in open
/gems/aws-sdk-s3-1.169.0/lib/aws-sdk-s3/file_uploader.rb:55 in open_file
/gems/aws-sdk-s3-1.169.0/lib/aws-sdk-s3/file_uploader.rb:65 in put_object
/gems/aws-sdk-s3-1.169.0/lib/aws-sdk-s3/file_uploader.rb:46 in block in upload
/gems/aws-sdk-core-3.212.0/lib/aws-sdk-core/plugins/user_agent.rb:60 in metric
/gems/aws-sdk-s3-1.169.0/lib/aws-sdk-s3/file_uploader.rb:40 in upload
/gems/aws-sdk-s3-1.169.0/lib/aws-sdk-s3/customizations/object.rb:477 in block in upload_file
/gems/aws-sdk-core-3.212.0/lib/aws-sdk-core/plugins/user_agent.rb:60 in metric
/gems/aws-sdk-s3-1.169.0/lib/aws-sdk-s3/customizations/object.rb:476 in upload_file

And

  • Aws::Sigv4::Errors::MissingCredentialsError: missing credentials, provide credentials with one of the following options: - :access_key_id and :secret_access_key - :credentials - :credentials_provider Backtrace:
/gems/aws-sigv4-1.9.1/lib/aws-sigv4/signer.rb:717 in extract_credentials_provider
/gems/aws-sigv4-1.9.1/lib/aws-sigv4/signer.rb:147 in initialize
/gems/aws-sdk-s3-1.113.2/lib/aws-sdk-s3/plugins/s3_signer.rb:235 in new
/gems/aws-sdk-s3-1.113.2/lib/aws-sdk-s3/plugins/s3_signer.rb:235 in build_v4_signer
/gems/aws-sdk-s3-1.113.2/lib/aws-sdk-s3/plugins/s3_signer.rb:14 in block in <class:S3Signer>
/gems/aws-sdk-core-3.204.0/lib/seahorse/client/configuration.rb:72 in call
/gems/aws-sdk-core-3.204.0/lib/seahorse/client/configuration.rb:215 in block in resolve_defaults
/gems/aws-sdk-core-3.204.0/lib/seahorse/client/configuration.rb:59 in each
/gems/aws-sdk-core-3.204.0/lib/seahorse/client/configuration.rb:59 in each
/gems/aws-sdk-core-3.204.0/lib/seahorse/client/configuration.rb:214 in resolve_defaults
/gems/aws-sdk-core-3.204.0/lib/seahorse/client/configuration.rb:207 in value_at
/gems/aws-sdk-core-3.204.0/lib/seahorse/client/configuration.rb:191 in block in resolve
/usr/local/lib/ruby/3.2.0/set.rb:511 in each_key
/usr/local/lib/ruby/3.2.0/set.rb:511 in each
/gems/aws-sdk-core-3.204.0/lib/seahorse/client/configuration.rb:191 in resolve
/gems/aws-sdk-core-3.204.0/lib/seahorse/client/configuration.rb:179 in apply_defaults
/gems/aws-sdk-core-3.204.0/lib/seahorse/client/configuration.rb:152 in build!
/gems/aws-sdk-core-3.204.0/lib/seahorse/client/base.rb:65 in build_config
/gems/aws-sdk-core-3.204.0/lib/seahorse/client/base.rb:21 in initialize
/gems/aws-sdk-s3-1.113.2/lib/aws-sdk-s3/client.rb:435 in initialize
/gems/aws-sdk-core-3.204.0/lib/seahorse/client/base.rb:102 in new
/gems/aws-sdk-s3-1.113.2/lib/aws-sdk-s3/resource.rb:28 in initialize

Do you think that the 3 errors are related in any kind of way?

  1. My sidekiq-cloudwatchmetrics patch is running in one of our environments for 2 weeks now. The (original) error didn't occur yet. I don't think that it's enough time to assume that the fix does work, but it looks like a good direction.

Roy-Gal-Git avatar Feb 05 '25 12:02 Roy-Gal-Git

With the test for you are doing with the patch, are you also running with the before_refresh logging?

If the patch does work, we'll want to understand if the SDK is trying to refresh credentials already.

Those errors could be related - in general both are caused by some environment setup issue that results in the SDK being unable to find valid credentials during client initialization. Assuming these errors are happening from code running on EC2/ECS it would indicate a potential issue with IMDS not having valid credentials. This could happen for example if there is a clock skew issue on the host or something else is misconfigured.

alextwoods avatar Feb 05 '25 17:02 alextwoods

@alextwoods

With the test for you are doing with the patch, are you also running with the before_refresh logging?

No, it's just the patch.

Assuming these errors are happening from code running on EC2/ECS it would indicate a potential issue with IMDS not having valid credentials.

It runs on EC2 machines that are managed by an Elastic Beanstalk environment.

Roy-Gal-Git avatar Feb 06 '25 14:02 Roy-Gal-Git

Your previous code had expired credentials (a valid ruby object with just invalid credentials) but that new error indicates your entire credentials object has been blown away. Here is where that is raised: https://github.com/aws/aws-sdk-ruby/blob/version-3/gems%2Faws-sdk-core%2Flib%2Faws-sdk-core%2Fplugins%2Fsign.rb. I do think it's still something with your setup or environment - no SDK code will remove credentials from configuration that I'm aware of.

mullermp avatar Feb 06 '25 14:02 mullermp

@mullermp Interesting!

Is it possible that initializing Aws::S3::Resource objects with access_key_id and secret_access_key set to nil could be causing this issue?

Aws::S3::Resource.new(region: 'REGION_PLACEHOLDER', access_key_id: nil, secret_access_key: nil)

I also noticed that sometimes we pass a hardcoded region, while other times we use a memoized instance of Aws::EC2Metadata to fetch the EC2 instance's region from the metadata API. Could this discrepancy contribute to the credential drifts you mentioned?

For context, we always use the latest AMI for our Elastic Beanstalk environments and don’t make any manual changes to it. In practice, we run everything with Docker on ECS. Could something in this setup be affecting credential persistence?

Roy-Gal-Git avatar Feb 09 '25 09:02 Roy-Gal-Git

Why are you setting a fake region and nil credentials in your client/resource?

mullermp avatar Feb 09 '25 18:02 mullermp

@mullermp I just changed the region to the placeholder. It's 'us-east-1' if it matters. The credentials are nil so they'll fetched from the default credential provider chain.

Roy-Gal-Git avatar Feb 10 '25 08:02 Roy-Gal-Git

@alextwoods @mullermp

An additional possible cause - time drift on your instance - if the system clock is off, the refresh of credentials on in the instance itself will be incorrect. Are you running anything like chrony, Amazon time Sync service or some other NTP setup?

I was wrong when I said we don't use those services.

The instance is using Amazon Time Sync Service (169.254.169.123) with Chrony. The system time is accurate, with an offset of only a few nanoseconds, and chronyc tracking confirms synchronization is stable. Time drift is not an issue.

Roy-Gal-Git avatar Feb 10 '25 09:02 Roy-Gal-Git

You should not need to set any keys to nil at all for the credentials provider chain to work. They can be omitted. At some point, your credentials are being blown away. I'm not sure why, but perhaps usage in your code base that we cannot see.

mullermp avatar Feb 10 '25 12:02 mullermp