fluent-bit icon indicating copy to clipboard operation
fluent-bit copied to clipboard

Frequent 'kubelet upstream connection errors' during startup

Open rtalipov opened this issue 1 year ago • 1 comments

Bug Report

Fluent-bit is configured to use kubelet to get metadata

When new node is started and for some reason kubelet is not ready to start communication, fluent-bit is frequently logging the following error logs:

[error] [tls] error: unexpected EOF
[error] [filter:kubernetes:kubernetes.1] kubelet upstream connection error

To Reproduce

Example 1 fluent-bit is scheduled on the new node and tries to connect to kubelet where CNI is not ready. During 13 seconds kubelet upstream connection error and '[tls] error: unexpected EOF' logs are generated ~7K times

[2024/05/24 07:34:21] [error] [tls] error: unexpected EOF
[2024/05/24 07:34:21] [error] [filter:kubernetes:kubernetes.1] kubelet upstream connection error
[...]
[2024/05/24 07:34:33] [error] [filter:kubernetes:kubernetes.1] kubelet upstream connection error
[2024/05/24 07:34:34] [error] [tls] error: unexpected EOF 

Example 2 New node is starting, fluent-bit is trying to connect to kubelet where certificate is not issued. For each connection attempt kubelet is generating error no serving certificate available for the kubelet

Jun  4 04:01:41 ip-A-B-C-D.eu-central-1.compute.internal kernel: process '/fluent-bit/bin/fluent-bit' started with executable stack
Jun  4 04:01:41 ip-A-B-C-D.eu-central-1.compute.internal kubelet: I0604 04:01:41.752427    3438 log.go:194] http: TLS handshake error from 127.0.0.1:53014: no serving certificate available for the kubelet
[...]
Jun  4 04:01:42 ip-A-B-C-D.eu-central-1.compute.internal kubelet: I0604 04:01:42.983070    3438 csr.go:261] certificate signing request csr-b7bnj is approved, waiting to be issued
[...]
Jun  4 04:01:43 ip-A-B-C-D.eu-central-1.compute.internal kubelet: I0604 04:01:43.041195    3438 log.go:194] http: TLS handshake error from 127.0.0.1:58166: no serving certificate available for the kubelet
Jun  4 04:01:43 ip-A-B-C-D.eu-central-1.compute.internal kubelet: I0604 04:01:43.096856    3438 csr.go:257] certificate signing request csr-b7bnj is issued

As per the above kubelet logs it takes 2 seconds to approve CSR and issue kubelet certificate. 577 kubelet upstream connection error logs were generated.

Expected behavior Fluent bit should not so aggressively try to connect to kubelet and generate so many error logs. It should delay the connection for 1 second after unsuccessful attempt to give kubelet and CNI time to become ready.

Your Environment

  • Version used: v3.0.1, v2.2.2
  • Configuration: Fluent-bit is configured to use kubelet to get metadata
  • Environment name and version: EKS v1.26

Additional context These error logs are forwarded to the logging server and take a lot of space for big and dynamic clusters.

rtalipov avatar Jun 05 '24 09:06 rtalipov

I have this error when I used fluent bit 3.0.7. And i am ok with fluent bit 3.0.4. There is nothing different with my config when I use these two version.

pallasathena92 avatar Jun 28 '24 22:06 pallasathena92

Also experiencing this issue with fluentbit v3.1.4 and EKS v1.29 if Use_Kubelet is set to On in the kubernetes filter.

headj-origami avatar Aug 20 '24 14:08 headj-origami

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

github-actions[bot] avatar Nov 19 '24 02:11 github-actions[bot]

This issue was closed because it has been stalled for 5 days with no activity.

github-actions[bot] avatar Nov 24 '24 02:11 github-actions[bot]

Bump, looks like this is still happening with AWS for Fluent Bit Container Image Version 2.32.4 (Fluent Bit v1.9.10) and default fluent-bit config from https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Container-Insights-setup-logs-FluentBit.html

alexivanov avatar Feb 14 '25 14:02 alexivanov

Still an error, version 4.0.1 fluent-bit

discostur avatar May 22 '25 17:05 discostur

This actually generates an enormous log volume as there is a feedback loop of failing to reach the kublet, then logging an error about not being able to reach the kublet, then fluent-bit ingesting and attempting to parse that error and triggering another error. So essentially thousands of logs are generated at node start-up.

We're going to attempt to resolve this with an init container on fluent bit and some form of the following:

TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)
curl -k -H "Authorization: Bearer $TOKEN" https://127.0.0.1:10250/healthz

zrice57 avatar Sep 03 '25 21:09 zrice57