fluent-bit icon indicating copy to clipboard operation
fluent-bit copied to clipboard

aws_credentials_http: Add support for EKS Pod Identities

Open iandrewt opened this issue 1 year ago • 8 comments

This PR initially includes two commits, as I had to patch flb_utils for IPv6 clusters to work.

This patch rewrites how the HTTP credentials provider works to allow both ECS and EKS identities to work. It is based on the aws-sdk-go-v2 implementation.

It validates that the endpoint is correct if the transport is HTTP, but does not support DNS resolution, however based on how the pod identity agent works today, DNS should not be needed. If the transport is HTTPS, which will not be the case when using EKS Pod Identities today, any endpoint is allowed. This is in line with how the AWS SDK works.

Similarly to the SDK, it also reads the authentication token environment variables, with the file taking precedence over the raw token variable.

This has been tested against an EKS 1.30 cluster with AL2023 nodes, but it would be great if someone else could test this too rather than just taking my word for it :smile:. I did test with an ECS cluster as well to ensure that did not break as a result of this change, but I don't generally use ECS, so if someone else could validate this that would be great.

Fixes #8550


Enter [N/A] in the box, if an item is not applicable to your change.

Testing Before we can approve your change; please submit the following in a comment:

  • [N/A] Example configuration file for the change - installed on EKS with the fluent-bit helm chart with an image override, with an S3 destination, with no annotation on the service account, a pod identity configured, and the node role not having permission to use S3 (I don't think I set the IMDS hop limit to 1 in my testing)
  • [x] Debug log output from testing the change
[2024/06/26 12:50:33] [ info] [output:s3:s3.0] Using upload size 100000000 bytes
[2024/06/26 12:50:33] [debug] [aws_credentials] Initialized Env Provider in standard chain
[2024/06/26 12:50:33] [debug] [aws_credentials] creating profile (null) provider
[2024/06/26 12:50:33] [debug] [aws_credentials] Initialized AWS Profile Provider in standard chain
[2024/06/26 12:50:33] [debug] [aws_credentials] Not initializing EKS provider because AWS_ROLE_ARN was not set
[2024/06/26 12:50:33] [debug] [aws_credentials] Configuring HTTP provider with http://[fd00:ec2::23]/v1/credentials
[2024/06/26 12:50:33] [debug] [aws_credentials] Initialized HTTP Provider in standard chain
[2024/06/26 12:50:33] [debug] [aws_credentials] Initialized EC2 Provider in standard chain
[2024/06/26 12:50:33] [debug] [aws_credentials] Sync called on the http provider
[2024/06/26 12:50:33] [debug] [aws_credentials] Sync called on the EC2 provider
[2024/06/26 12:50:33] [debug] [aws_credentials] Init called on the env provider
[2024/06/26 12:50:33] [debug] [aws_credentials] Init called on the profile provider
[2024/06/26 12:50:33] [debug] [aws_credentials] Reading shared config file.
[2024/06/26 12:50:33] [debug] [aws_credentials] Shared config file /root/.aws/config does not exist
[2024/06/26 12:50:33] [debug] [aws_credentials] Reading shared credentials file.
[2024/06/26 12:50:33] [debug] [aws_credentials] Shared credentials file /root/.aws/credentials does not exist
[2024/06/26 12:50:33] [debug] [aws_credentials] Init called on the http provider
[2024/06/26 12:50:33] [debug] [aws_credentials] loading auth token file
[2024/06/26 12:50:33] [debug] [http_client] not using http_proxy for header
[2024/06/26 12:50:34] [debug] [aws_credentials] upstream_set called on the http provider
[2024/06/26 12:50:34] [debug] [aws_credentials] upstream_set called on the EC2 provider
  • [x] Attached Valgrind output that shows no leaks or memory corruption was found Not sure how I'd test this in-cluster, but the tests for the credential provider should be evidence enough. If more test cases are needed I'm happy to write them
valgrind --leak-check=full ./bin/flb-it-aws_credentials_http
==726645== Memcheck, a memory error detector
==726645== Copyright (C) 2002-2024, and GNU GPL'd, by Julian Seward et al.
==726645== Using Valgrind-3.23.0 and LibVEX; rerun with -h for copyright info
==726645== Command: ./bin/flb-it-aws_credentials_http
==726645== 
Test test_http_provider...                      [2024/06/26 22:52:50] [ warn] [aws_credentials] Credential expiration '2025-10-24T23:00:23Z' is greater than 12 hours in the future. This should not be possible.
[2024/06/26 22:52:50] [ warn] [aws_credentials] Credential expiration '2025-10-24T23:00:23Z' is greater than 12 hours in the future. This should not be possible.
==726645== Warning: invalid file descriptor -1 in syscall close()
[ OK ]
==726645== Warning: invalid file descriptor -1 in syscall close()
Test test_http_provider_auth_token...           [2024/06/26 22:52:50] [ warn] [aws_credentials] Credential expiration '2025-10-24T23:00:23Z' is greater than 12 hours in the future. This should not be possible.
[2024/06/26 22:52:50] [ warn] [aws_credentials] Credential expiration '2025-10-24T23:00:23Z' is greater than 12 hours in the future. This should not be possible.
==726645== Warning: invalid file descriptor -1 in syscall close()
[ OK ]
==726645== Warning: invalid file descriptor -1 in syscall close()
Test test_local_http_provider...                [2024/06/26 22:52:50] [ warn] [aws_credentials] Credential expiration '2025-10-24T23:00:23Z' is greater than 12 hours in the future. This should not be possible.
[2024/06/26 22:52:50] [ warn] [aws_credentials] Credential expiration '2025-10-24T23:00:23Z' is greater than 12 hours in the future. This should not be possible.
==726645== Warning: invalid file descriptor -1 in syscall close()
[ OK ]
==726645== Warning: invalid file descriptor -1 in syscall close()
Test test_http_provider_error_case...           [2024/06/26 22:52:50] [ warn] [aws_credentials] No cached credentials are available and a credential refresh is already in progress. The current co-routine will retry.
[2024/06/26 22:52:50] [ warn] [aws_credentials] No cached credentials are available and a credential refresh is already in progress. The current co-routine will retry.
==726645== Warning: invalid file descriptor -1 in syscall close()
[ OK ]
==726645== Warning: invalid file descriptor -1 in syscall close()
Test test_http_provider_malformed_response...   [2024/06/26 22:52:51] [error] [aws_credentials] Could not parse credentials response - invalid JSON.
[2024/06/26 22:52:51] [ warn] [aws_credentials] No cached credentials are available and a credential refresh is already in progress. The current co-routine will retry.
[2024/06/26 22:52:51] [error] [aws_credentials] Could not parse credentials response - invalid JSON.
[2024/06/26 22:52:51] [ warn] [aws_credentials] No cached credentials are available and a credential refresh is already in progress. The current co-routine will retry.
[2024/06/26 22:52:51] [error] [aws_credentials] Could not parse credentials response - invalid JSON.
==726645== Warning: invalid file descriptor -1 in syscall close()
[ OK ]
==726645== Warning: invalid file descriptor -1 in syscall close()
SUCCESS: All unit tests have passed.
==726645== 
==726645== HEAP SUMMARY:
==726645==     in use at exit: 0 bytes in 0 blocks
==726645==   total heap usage: 8,637 allocs, 8,637 frees, 895,672 bytes allocated
==726645== 
==726645== All heap blocks were freed -- no leaks are possible
==726645== 
==726645== For lists of detected and suppressed errors, rerun with: -s
==726645== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

If this is a change to packaging of containers or native binaries then please confirm it works for all targets.

  • [N/A] Run local packaging test showing all targets (including any new ones) build.
  • [N/A] Set ok-package-test label to test for all targets (requires maintainer to do).

Documentation

  • [x] Documentation required for this feature

https://github.com/fluent/fluent-bit-docs/pull/1400

Backporting

  • [N/A] Backport to latest stable release. I'm fine with this being in a minor release instead of a patch, given the testing it should receive

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

iandrewt avatar Jun 26 '24 13:06 iandrewt

I suspect we probably need some docs updates for this as well, or at least a simple example of how to use it? Can we link a docs PR?

patrick-stephens avatar Jun 27 '24 13:06 patrick-stephens

I found this while searching for docs to update, but I can't see it referenced anywhere on docs.fluentbit.io. Do you want a link added to this page somewhere as well?

https://github.com/fluent/fluent-bit-docs/blob/43c4fe134611da471e706b0edb2f9acd7cdfdbc3/administration/aws-credentials.md

iandrewt avatar Jun 27 '24 21:06 iandrewt

I found this while searching for docs to update, but I can't see it referenced anywhere on docs.fluentbit.io. Do you want a link added to this page somewhere as well?

https://github.com/fluent/fluent-bit-docs/blob/43c4fe134611da471e706b0edb2f9acd7cdfdbc3/administration/aws-credentials.md

I think it was linked from the S3 or similar pages, I'm sure I found it before. It really should be a docs page in the index probably too.

patrick-stephens avatar Jun 28 '24 09:06 patrick-stephens

Added the AWS Credentials docs to the index: https://github.com/fluent/fluent-bit-docs/pull/1400

iandrewt avatar Jun 28 '24 12:06 iandrewt

Looks like @PettitWesley has been working on the same feature for the aws/aws-for-fluent-bit distro: https://github.com/PettitWesley/fluent-bit/pull/32

Happy if you want to wait for that PR instead, it seems to handle some things differently to how I implemented it. For instance, my implementation reads the auth token environment variables once at startup (file path, not the contents) rather than on each refresh, which is in line with how I saw it implemented in the AWS SDKs. However, mine might not handle tokens with \r\n correctly (this should be an error case)

In any case, I definitely learned a lot about how the credential chain works by implementing this.

iandrewt avatar Jun 28 '24 13:06 iandrewt

I've updated my PR with some improved error handling and rewritten tests that validate the unhappy path better, after looking at how my PR and @PettitWesley's branch differs. Running valgrind against the test cases reports no memory leaks, including fixing one that wasn't caught by the original tests I had written.

iandrewt avatar Jun 29 '24 06:06 iandrewt

That's all good, I'm fine with your changes. As I had mentioned earlier, I had made my own implementation of the feature before finding your branch, then updated mine to account the behavioural differences between the two.

Happy for this to be closed in favour of #9206 - I'll give it a whirl in my cluster too.

iandrewt avatar Aug 14 '24 04:08 iandrewt

This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days.

github-actions[bot] avatar Dec 15 '24 02:12 github-actions[bot]

@singholt should we close this one in favor of https://github.com/fluent/fluent-bit/pull/9206?

agup006 avatar Apr 22 '25 16:04 agup006

@singholt should we close this one in favor of #9206?

Yes, @agup006 !

singholt avatar Apr 29 '25 13:04 singholt