vector aws_s3 Source Failing to assume_role After Update to 0.36.0

A note for the community

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

We just updated our Vector deployment to use version 0.36.0, and suddenly a pipeline of ours which is utilizing the aws_s3 source with an assume_role which grants it access to an SQS queue and S3 bucket stopped being able to access messages in the SQS queue. This pipeline has been running without issue for 4 months on previous Vector versions (the latest we've used before this was 0.35.0), so it seems like a change introduced in 0.35.1 or later has caused the problem.

Our other pipelines that don't rely on an assume_role are still working just fine with the newest version, which leads us to believe that something has changed with the way that: auth: assume_role: works in the aws_s3 source. Does anyone have any ideas? We have a hunch it could be related to this change in 0.35.1, but aren't sure: https://github.com/vectordotdev/vector/commit/c2cc94a262ecf39798009d29751d59cc97baa0c5#diff-d6eef19144b594971c40fce6c9e73777c346036e50277c3874a665ce31bccfb8

Debug logs show an issue we've never seen before re: environment variables, which seems like the culprit: environment variable not set (CredentialsNotLoaded(CredentialsNotLoaded { source: "environment variable not set" }))

Configuration

our_source_name:
    type: aws_s3
    compression: auto
    region: eu-central-1
    sqs:
      delete_message: true
      queue_url: source_sqs_queue_url
    auth:
      assume_role: assume_role_arn
      region: "us-east-1"



### Version

vector 0.36.0

### Debug Output

```text
2024-02-14T21:56:29.202231Z DEBUG assume_role:provide_credentials{provider=default_chain}: aws_config::meta::credentials::chain: loaded credentials provider=WebIdentityToken
2024-02-14T21:56:29.202336Z DEBUG assume_role: aws_config::sts::assume_role: retrieving assumed credentials
2024-02-14T21:56:29.202384Z DEBUG assume_role:provide_credentials{provider=default_chain}: aws_config::meta::credentials::chain: provider in chain did not provide credentials provider=Environment context=the credential provider was not enabled: environment variable not set (CredentialsNotLoaded(CredentialsNotLoaded { source: "environment variable not set" }))
2024-02-14T21:56:29.202401Z DEBUG assume_role:provide_credentials{provider=default_chain}: aws_config::meta::credentials::chain: provider in chain did not provide credentials provider=Profile context=the credential provider was not enabled: No profiles were defined (CredentialsNotLoaded(CredentialsNotLoaded { source: NoProfilesDefined }))

Example Data

No response

Additional Context

Here are the error logs we're seeing related to this issue: vector::internal_events::aws_sqs: Failed to fetch SQS events. error=dispatch failure error_code="failed_fetching_sqs_events" error_type="request_failed" stage="receiving" internal_log_rate_limit=true

References

No response

Feb 14 '24 22:02 gswai123

For the time being, I've rolled back to Vector version 0.35.0 for this pipeline, and that has resolved the issue. I would like to be able to use some of the features in 0.36.0, though, so it would be nice to get this problem sorted out. Thanks!

Feb 14 '24 23:02 gswai123

I asked this in Discord, but asking here too in case others stumble upon this issue and can fill in some of the blanks:

How are you providing the credentials to "assume role" with? Environment variables? Instance profile (IMDSv2)? Something else?
It seems like this is just a snippet of the logs. Would it be possible to provide the full set?

Thanks!

Feb 16 '24 15:02 jszwedko

Hey @jszwedko! So our Vector pipelines run in Kubernetes, and each one is deployed with a service account that has been configured to assume a main Vector role in AWS that has access to all of the various AWS resources. However, in the case of this specific pipeline, we must assume a role in another account to access the specific S3 bucket that the logs are in. The main Vector AWS role has been granted access to assume this cross-account role and access the logs. This has worked as expected up until this newest update.

Also re: the debug logs: the full set of logs was just that snippet I pasted repeated over and over again. I can try to update the deployment version again to replicate the log set and post more of them if that's useful though. Thanks!

Feb 16 '24 17:02 gswai123

Gotcha, thanks! Just to confirm are you deploying in EKS and using https://docs.aws.amazon.com/eks/latest/userguide/pod-configuration.html to configure a service role for each pod?

Feb 16 '24 18:02 jszwedko

Gotcha, thanks! Just to confirm are you deploying in EKS and using https://docs.aws.amazon.com/eks/latest/userguide/pod-configuration.html to configure a service role for each pod?

Exactly!

Feb 16 '24 18:02 gswai123

+1 happens to me as well, cannot verify debug log, but downgrade to 0.35 fixed the issue. The setup is basically the same as gswai123

Feb 21 '24 13:02 Stazis555

We have the same issues using aws_s3 as sink. Running a vector pod in EKS where the container assumes a role.

Feb 21 '24 16:02 andibraeu

@gswai123 @andibraeu @Stazis555 I believe this issue should be fixed in our nightly builds now.

I wasn't getting able to reproduce the exact issue that you were having (evironment variables etc..) but there was a definite bug in the area that has now been fixed.

Would it be possible to try the nightly builds to see if the issue is actually fixed for you?

Let me know if you have any questions.

Feb 29 '24 13:02 StephenWakely

To make the images easier to find, these would be the latest nightly images: https://hub.docker.com/r/timberio/vector/tags?page=1&name=nightly-2024-02-29 . It would be great to have validation that they fix the issue for you before we cut a v0.36.1 release 🙏

Feb 29 '24 13:02 jszwedko

@jszwedko @StephenWakely thanks for being so quick to release a fix. I'm testing out the latest nightly build now. I'll let you know if it fixes the issue in a bit!

Feb 29 '24 19:02 gswai123

The fix is working for me! We're no longer seeing any errors with the latest nightly build. Thank you!

Feb 29 '24 21:02 gswai123

vector vector copied to clipboard

aws_s3 Source Failing to assume_role After Update to 0.36.0

A note for the community

Problem

Configuration

Example Data

Additional Context

References

vector
vector copied to clipboard