fluent-bit icon indicating copy to clipboard operation
fluent-bit copied to clipboard

IRSA credentials failing for AWS MSK IAM in Fluent Bit 4.1.0 — STS AssumeRoleWithWebIdentity returns broken connection (HTTP Status: 0) and fallback to IMDS occurs

Open ZokerG opened this issue 3 weeks ago • 2 comments

Bug Report

When running Fluent Bit 4.1.0 on Amazon EKS using IRSA (IAM Roles for Service Accounts) to authenticate to MSK IAM, Fluent Bit consistently fails during the call to:

STS AssumeRoleWithWebIdentity

The internal AWS credential provider logs always show:

broken connection to sts.us-east-1.amazonaws.com:443 (HTTP Status: 0) STS assume role request failed

After the failure, Fluent Bit incorrectly falls back to IMDSv2, and retrieves credentials from the EC2 node role, not from the pod’s IRSA role.

This results in invalid MSK OAuthBearer tokens and ultimately:

SASL authentication error: Access denied

Therefore, IRSA does not work with Fluent Bit for MSK IAM, even though the environment is correctly configured and STS is reachable externally.


🔄 Steps to Reproduce

1. Create Service Account with IRSA

apiVersion: v1 kind: ServiceAccount metadata: name: fluent-bit-irsa-serviceaccount namespace: poc-ciam annotations: eks.amazonaws.com/role-arn: arn:aws:iam::<ACCOUNT_ID>:role/kubernetes-pod-test

2. Inject AWS environment variables into Fluent Bit pod

env: - name: AWS_ROLE_ARN value: arn:aws:iam::<ACCOUNT_ID>:role/kuberentes-pod-test - name: AWS_WEB_IDENTITY_TOKEN_FILE value: /var/run/secrets/eks.amazonaws.com/serviceaccount/token - name: AWS_REGION value: us-east-1

3. Configure MSK IAM authentication

[OUTPUT] Name kafka Brokers b-1.example.amazonaws.com:9098,b-2.example.amazonaws.com:9098 Topics audit-filtered msk_iam yes

4. Enable debug logs

log_level debug

❌ Actual Behavior

1. Fluent Bit fails during STS AssumeRoleWithWebIdentity

[debug] [aws_credentials] Calling STS.. [debug] [http_client] not using http_proxy for header [error] [http_client] broken connection to sts.us-east-1.amazonaws.com:443 ? [debug] [aws_client] sts.us-east-1.amazonaws.com: http_do=-1, HTTP Status: 0 [debug] [aws_credentials] STS assume role request failed

2. Fluent Bit incorrectly falls back to IMDSv2

[debug] [aws_credentials] Init called on the EC2 IMDS provider [debug] [aws_credentials] Requesting credentials for instance role eksctl-nodegroup-NodeInstanceRole

This behavior is incorrect when IRSA is configured.

3. MSK authentication ultimately fails

SASL authentication error: Access denied (state AUTH_REQ)

The OAuth token is being signed with the wrong AWS principal (EC2 node instead of IRSA role).


✔ Expected Behavior

  • Fluent Bit should sucessfully perform AssumeRoleWithWebIdentity using the service account token.

  • There should be no fallback to IMDS when IRSA is active.

  • MSK IAM authentication should work using pod-level AWS IAM credentials.


🔍 Additional Diagnostics Performed

We performed multiple environment-level tests to rule out network, TLS, DNS and AWS STS connectivity issues.

1️⃣ Successful STS connectivity from test pod using curl

Running inside a pod using the same ServiceAccount:

curl -v https://sts.us-east-1.amazonaws.com/

The output shows:

  • TLS handshake works

  • STS responds with HTTP/1.1 302 Found

  • Certificate validation succeeds

This confirms the cluster CAN reach STS successfully.


2️⃣ Successful Fluent Bit HTTP output to STS

We configured Fluent Bit with a temporary HTTP output:

[OUTPUT] Name http Host sts.us-east-1.amazonaws.com Port 443 URI / tls on tls.verify off Format json

Result:

HTTP/1.1 302 Found

This demonstrates that:

✔ Fluent Bit’s HTTP client works

✔ TLS negotiation works

✔ External connectivity is fine

❗ Only the internal STS client inside:

flb_aws_credentials_sts.c
fails to connect.

This strongly points to:

  • internal TLS handling bug

  • keep-alive connection reuse issue

  • missing SNI or TLS config problem

  • incorrect interaction with AWS STS endpoints


📂 Environment Details

Component | Version -- | -- Fluent Bit | 4.1.0 AWS EKS | 1.29 MSK | IAM authentication enabled IRSA | correctly configured and verified curl test to STS | success Fluent Bit HTTP output to STS | success Internal STS AssumeRoleWithWebIdentity | fails

📎 Recommended Evidence to Attach

We will attach:

  • Full Fluent Bit logs (with aws_credentials + http_client debug)

  • Curl output showing STS connectivity success

  • HTTP output plugin logs showing 302 from STS

  • IAM trust policy JSON

  • Kubernetes ServiceAccount manifest

  • Pod description showing injected AWS env vars

  • Screenshots of OIDC provider in AWS IAM console


🚀 Summary for Fluent Bit maintainers

The environment can reach STS with no issues (tested via curl and via Fluent Bit’s own HTTP output plugin).

Only the internal AWS credential provider inside Fluent Bit fails to connect to STS, returning HTTP Status 0.

This causes a fallback to IMDS, which breaks MSK IAM authentication.

As a result, IRSA cannot be used with MSK IAM in Fluent Bit 4.1.0.

[OUTPUT] Name kafka Match #{kafka-match}# Brokers #{aws-msk-bootstrap-servers}# Topics #{kafka-topic}# Format json aws_msk_iam true aws_msk_iam_cluster_arn #{aws-msk-iam-cluster-arn}#

Image Image

ZokerG avatar Dec 04 '25 21:12 ZokerG

@ZokerG thanks for the detailed bug report.

Would you please try the patch provided in https://github.com/fluent/fluent-bit/pull/11256 ?

edsiper avatar Dec 04 '25 23:12 edsiper

@edsiper Hi, thank you for pushing the recent fix. We would like to confirm whether this fix is already included in Fluent Bit v4.2.1, because we tested directly with the fluent-bit:4.2.1 image and the issue is still occurring.

Here is what we observed during our tests:

The STS authentication flow inside Fluent Bit still fails, exactly as it did before the fix. Fluent Bit attempts to call sts:AssumeRoleWithWebIdentity, but the connection to sts.us-east-1.amazonaws.com:443 consistently results in:

[error] [http_client] broken connection to sts.us-east-1.amazonaws.com:443 ? [debug] [aws_client] sts.us-east-1.amazonaws.com: http_do=-1, HTTP Status: 0 [debug] [aws_credentials] STS assume role request failed

After that, Fluent Bit falls back to IMDS, which should not happen for a pod running under IRSA.

We performed external validation to confirm that network connectivity to STS is working correctly:

From a test pod using the same ServiceAccount, running curl https://sts.us-east-1.amazonaws.com works and returns a 302 as expected.

We also configured an HTTP output in Fluent Bit pointing directly to STS, and that output works normally, also returning a 302.

Based on these tests, it appears that the internal Fluent Bit component responsible for performing the STS request is still experiencing the issue, even in version 4.2.1.

Could you please confirm whether the fix you mentioned has been included in the 4.2.1 build, or if we should test with a specific image or nightly build?

We are ready to test any build or commit you suggest and can provide detailed logs with log_level debug and rdkafka.debug=security,broker,protocol.

ZokerG avatar Dec 05 '25 14:12 ZokerG

@ZokerG you may tested with my PR. ##11270 confirmed fixed TLS issue and MSK auth failure after certain time, works for both provisioned and serverless MSK. let me know if you facing any issue with that build.

kalavt avatar Dec 12 '25 04:12 kalavt

You can find built artifacts in https://github.com/fluent/fluent-bit/actions/runs/20140187371?pr=11270. This page provides RPM/DEB packages with PR 11270 patches for various Linux distributions.

cosmo0920 avatar Dec 12 '25 05:12 cosmo0920

@cosmo0920 @kalavt Hi Fluent Bit maintainers,

We’d like to share an update regarding Issue #11255 and ask about the timeline for an official/public Fluent Bit image that includes the changes from PR #11270. 

Context / What we were seeing

We run Fluent Bit on Amazon EKS using IRSA to authenticate to AWS MSK IAM. In our setup, Fluent Bit failed during STS AssumeRoleWithWebIdentity (showing HTTP Status: 0 / broken connection), then fell back to IMDSv2 and used the EC2 node role, which ultimately caused MSK IAM authentication failures (SASL authentication error: Access denied). 

What we tested

Following guidance from the team, we tested PR #11270 (“aws_msk_iam: add AWS MSK IAM authentication support”). This PR includes broader MSK IAM/OAUTHBEARER improvements and also adds TLS support for AWS credential fetching, including STS, which seemed relevant to our failure mode. 

We: 1. Built a container image from the PR branch/commit, 2. Pushed it to our internal registry, 3. Deployed it into our application’s deployment (same cluster/config).

Result: With the PR #11270 image, the IRSA/MSK IAM issue stopped reproducing in our environment.

Request: official/public image availability

Due to internal compliance requirements, we can only run official public images (not custom images built from PR branches). Could you please confirm: • If PR #11270 is planned to be merged, and if so, which release line it is expected to land in (e.g., v4.2.1 or another patch/minor)? • We noticed Issue #11255 is currently assigned to milestone Fluent Bit v4.2.1.  • Once merged, when would the official public container image be available with these changes?

For reference, we see releases like v4.2.0 (Nov 12, 2025) and v4.1.2 (Dec 10, 2025) already published. 

Thanks again for the work on this—our testing indicates PR #11270 resolves the problem for our use case

Best regards, Carlos Quintero

ZokerG avatar Dec 12 '25 05:12 ZokerG