vector
vector copied to clipboard
aws_s3 Source Failing to assume_role After Update to 0.36.0
A note for the community
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
Problem
We just updated our Vector deployment to use version 0.36.0, and suddenly a pipeline of ours which is utilizing the aws_s3
source with an assume_role
which grants it access to an SQS queue and S3 bucket stopped being able to access messages in the SQS queue. This pipeline has been running without issue for 4 months on previous Vector versions (the latest we've used before this was 0.35.0), so it seems like a change introduced in 0.35.1 or later has caused the problem.
Our other pipelines that don't rely on an assume_role are still working just fine with the newest version, which leads us to believe that something has changed with the way that: auth: assume_role: works in the aws_s3 source. Does anyone have any ideas? We have a hunch it could be related to this change in 0.35.1, but aren't sure: https://github.com/vectordotdev/vector/commit/c2cc94a262ecf39798009d29751d59cc97baa0c5#diff-d6eef19144b594971c40fce6c9e73777c346036e50277c3874a665ce31bccfb8
Debug logs show an issue we've never seen before re: environment variables, which seems like the culprit: environment variable not set (CredentialsNotLoaded(CredentialsNotLoaded { source: "environment variable not set" }))
Configuration
our_source_name:
type: aws_s3
compression: auto
region: eu-central-1
sqs:
delete_message: true
queue_url: source_sqs_queue_url
auth:
assume_role: assume_role_arn
region: "us-east-1"
### Version
vector 0.36.0
### Debug Output
```text
2024-02-14T21:56:29.202231Z DEBUG assume_role:provide_credentials{provider=default_chain}: aws_config::meta::credentials::chain: loaded credentials provider=WebIdentityToken
2024-02-14T21:56:29.202336Z DEBUG assume_role: aws_config::sts::assume_role: retrieving assumed credentials
2024-02-14T21:56:29.202384Z DEBUG assume_role:provide_credentials{provider=default_chain}: aws_config::meta::credentials::chain: provider in chain did not provide credentials provider=Environment context=the credential provider was not enabled: environment variable not set (CredentialsNotLoaded(CredentialsNotLoaded { source: "environment variable not set" }))
2024-02-14T21:56:29.202401Z DEBUG assume_role:provide_credentials{provider=default_chain}: aws_config::meta::credentials::chain: provider in chain did not provide credentials provider=Profile context=the credential provider was not enabled: No profiles were defined (CredentialsNotLoaded(CredentialsNotLoaded { source: NoProfilesDefined }))
Example Data
No response
Additional Context
Here are the error logs we're seeing related to this issue: vector::internal_events::aws_sqs: Failed to fetch SQS events. error=dispatch failure error_code="failed_fetching_sqs_events" error_type="request_failed" stage="receiving" internal_log_rate_limit=true
References
No response
For the time being, I've rolled back to Vector version 0.35.0 for this pipeline, and that has resolved the issue. I would like to be able to use some of the features in 0.36.0, though, so it would be nice to get this problem sorted out. Thanks!
I asked this in Discord, but asking here too in case others stumble upon this issue and can fill in some of the blanks:
- How are you providing the credentials to "assume role" with? Environment variables? Instance profile (IMDSv2)? Something else?
- It seems like this is just a snippet of the logs. Would it be possible to provide the full set?
Thanks!
Hey @jszwedko! So our Vector pipelines run in Kubernetes, and each one is deployed with a service account that has been configured to assume a main Vector role in AWS that has access to all of the various AWS resources. However, in the case of this specific pipeline, we must assume a role in another account to access the specific S3 bucket that the logs are in. The main Vector AWS role has been granted access to assume this cross-account role and access the logs. This has worked as expected up until this newest update.
Also re: the debug logs: the full set of logs was just that snippet I pasted repeated over and over again. I can try to update the deployment version again to replicate the log set and post more of them if that's useful though. Thanks!
Gotcha, thanks! Just to confirm are you deploying in EKS and using https://docs.aws.amazon.com/eks/latest/userguide/pod-configuration.html to configure a service role for each pod?
Gotcha, thanks! Just to confirm are you deploying in EKS and using https://docs.aws.amazon.com/eks/latest/userguide/pod-configuration.html to configure a service role for each pod?
Exactly!
+1 happens to me as well, cannot verify debug log, but downgrade to 0.35 fixed the issue. The setup is basically the same as gswai123
We have the same issues using aws_s3 as sink. Running a vector pod in EKS where the container assumes a role.
@gswai123 @andibraeu @Stazis555 I believe this issue should be fixed in our nightly builds now.
I wasn't getting able to reproduce the exact issue that you were having (evironment variables etc..) but there was a definite bug in the area that has now been fixed.
Would it be possible to try the nightly builds to see if the issue is actually fixed for you?
Let me know if you have any questions.
To make the images easier to find, these would be the latest nightly images: https://hub.docker.com/r/timberio/vector/tags?page=1&name=nightly-2024-02-29 . It would be great to have validation that they fix the issue for you before we cut a v0.36.1 release 🙏
@jszwedko @StephenWakely thanks for being so quick to release a fix. I'm testing out the latest nightly build now. I'll let you know if it fixes the issue in a bit!
The fix is working for me! We're no longer seeing any errors with the latest nightly build. Thank you!