IRSA credentials failing for AWS MSK IAM in Fluent Bit 4.1.0 — STS AssumeRoleWithWebIdentity returns broken connection (HTTP Status: 0) and fallback to IMDS occurs
Bug Report
When running Fluent Bit 4.1.0 on Amazon EKS using IRSA (IAM Roles for Service Accounts) to authenticate to MSK IAM, Fluent Bit consistently fails during the call to:
The internal AWS credential provider logs always show:
After the failure, Fluent Bit incorrectly falls back to IMDSv2, and retrieves credentials from the EC2 node role, not from the pod’s IRSA role.
This results in invalid MSK OAuthBearer tokens and ultimately:
Therefore, IRSA does not work with Fluent Bit for MSK IAM, even though the environment is correctly configured and STS is reachable externally.
🔄 Steps to Reproduce
1. Create Service Account with IRSA
2. Inject AWS environment variables into Fluent Bit pod
3. Configure MSK IAM authentication
4. Enable debug logs
❌ Actual Behavior
1. Fluent Bit fails during STS AssumeRoleWithWebIdentity
2. Fluent Bit incorrectly falls back to IMDSv2
This behavior is incorrect when IRSA is configured.
3. MSK authentication ultimately fails
The OAuth token is being signed with the wrong AWS principal (EC2 node instead of IRSA role).
✔ Expected Behavior
-
Fluent Bit should sucessfully perform AssumeRoleWithWebIdentity using the service account token.
-
There should be no fallback to IMDS when IRSA is active.
-
MSK IAM authentication should work using pod-level AWS IAM credentials.
🔍 Additional Diagnostics Performed
We performed multiple environment-level tests to rule out network, TLS, DNS and AWS STS connectivity issues.
1️⃣ Successful STS connectivity from test pod using curl
Running inside a pod using the same ServiceAccount:
The output shows:
-
TLS handshake works
-
STS responds with
HTTP/1.1 302 Found -
Certificate validation succeeds
This confirms the cluster CAN reach STS successfully.
2️⃣ Successful Fluent Bit HTTP output to STS
We configured Fluent Bit with a temporary HTTP output:
Result:
This demonstrates that:
✔ Fluent Bit’s HTTP client works
✔ TLS negotiation works
✔ External connectivity is fine
❗ Only the internal STS client inside:
flb_aws_credentials_sts.c
fails to connect.
This strongly points to:
-
internal TLS handling bug
-
keep-alive connection reuse issue
-
missing SNI or TLS config problem
-
incorrect interaction with AWS STS endpoints
📂 Environment Details
📎 Recommended Evidence to Attach
We will attach:
-
Full Fluent Bit logs (with aws_credentials + http_client debug)
-
Curl output showing STS connectivity success
-
HTTP output plugin logs showing 302 from STS
-
IAM trust policy JSON
-
Kubernetes ServiceAccount manifest
-
Pod description showing injected AWS env vars
-
Screenshots of OIDC provider in AWS IAM console
🚀 Summary for Fluent Bit maintainers
The environment can reach STS with no issues (tested via curl and via Fluent Bit’s own HTTP output plugin).
Only the internal AWS credential provider inside Fluent Bit fails to connect to STS, returning
HTTP Status 0.This causes a fallback to IMDS, which breaks MSK IAM authentication.
As a result, IRSA cannot be used with MSK IAM in Fluent Bit 4.1.0.
[OUTPUT] Name kafka Match #{kafka-match}# Brokers #{aws-msk-bootstrap-servers}# Topics #{kafka-topic}# Format json aws_msk_iam true aws_msk_iam_cluster_arn #{aws-msk-iam-cluster-arn}#
@ZokerG thanks for the detailed bug report.
Would you please try the patch provided in https://github.com/fluent/fluent-bit/pull/11256 ?
@edsiper Hi, thank you for pushing the recent fix. We would like to confirm whether this fix is already included in Fluent Bit v4.2.1, because we tested directly with the fluent-bit:4.2.1 image and the issue is still occurring.
Here is what we observed during our tests:
The STS authentication flow inside Fluent Bit still fails, exactly as it did before the fix. Fluent Bit attempts to call sts:AssumeRoleWithWebIdentity, but the connection to sts.us-east-1.amazonaws.com:443 consistently results in:
[error] [http_client] broken connection to sts.us-east-1.amazonaws.com:443 ? [debug] [aws_client] sts.us-east-1.amazonaws.com: http_do=-1, HTTP Status: 0 [debug] [aws_credentials] STS assume role request failed
After that, Fluent Bit falls back to IMDS, which should not happen for a pod running under IRSA.
We performed external validation to confirm that network connectivity to STS is working correctly:
From a test pod using the same ServiceAccount, running curl https://sts.us-east-1.amazonaws.com works and returns a 302 as expected.
We also configured an HTTP output in Fluent Bit pointing directly to STS, and that output works normally, also returning a 302.
Based on these tests, it appears that the internal Fluent Bit component responsible for performing the STS request is still experiencing the issue, even in version 4.2.1.
Could you please confirm whether the fix you mentioned has been included in the 4.2.1 build, or if we should test with a specific image or nightly build?
We are ready to test any build or commit you suggest and can provide detailed logs with log_level debug and rdkafka.debug=security,broker,protocol.
@ZokerG you may tested with my PR. ##11270 confirmed fixed TLS issue and MSK auth failure after certain time, works for both provisioned and serverless MSK. let me know if you facing any issue with that build.
You can find built artifacts in https://github.com/fluent/fluent-bit/actions/runs/20140187371?pr=11270. This page provides RPM/DEB packages with PR 11270 patches for various Linux distributions.
@cosmo0920 @kalavt Hi Fluent Bit maintainers,
We’d like to share an update regarding Issue #11255 and ask about the timeline for an official/public Fluent Bit image that includes the changes from PR #11270. 
Context / What we were seeing
We run Fluent Bit on Amazon EKS using IRSA to authenticate to AWS MSK IAM. In our setup, Fluent Bit failed during STS AssumeRoleWithWebIdentity (showing HTTP Status: 0 / broken connection), then fell back to IMDSv2 and used the EC2 node role, which ultimately caused MSK IAM authentication failures (SASL authentication error: Access denied). 
What we tested
Following guidance from the team, we tested PR #11270 (“aws_msk_iam: add AWS MSK IAM authentication support”). This PR includes broader MSK IAM/OAUTHBEARER improvements and also adds TLS support for AWS credential fetching, including STS, which seemed relevant to our failure mode. 
We: 1. Built a container image from the PR branch/commit, 2. Pushed it to our internal registry, 3. Deployed it into our application’s deployment (same cluster/config).
Result: With the PR #11270 image, the IRSA/MSK IAM issue stopped reproducing in our environment.
Request: official/public image availability
Due to internal compliance requirements, we can only run official public images (not custom images built from PR branches). Could you please confirm: • If PR #11270 is planned to be merged, and if so, which release line it is expected to land in (e.g., v4.2.1 or another patch/minor)? • We noticed Issue #11255 is currently assigned to milestone Fluent Bit v4.2.1.  • Once merged, when would the official public container image be available with these changes?
For reference, we see releases like v4.2.0 (Nov 12, 2025) and v4.1.2 (Dec 10, 2025) already published. 
Thanks again for the work on this—our testing indicates PR #11270 resolves the problem for our use case
Best regards, Carlos Quintero