Frequent 'kubelet upstream connection errors' during startup
Bug Report
Fluent-bit is configured to use kubelet to get metadata
When new node is started and for some reason kubelet is not ready to start communication, fluent-bit is frequently logging the following error logs:
[error] [tls] error: unexpected EOF
[error] [filter:kubernetes:kubernetes.1] kubelet upstream connection error
To Reproduce
Example 1
fluent-bit is scheduled on the new node and tries to connect to kubelet where CNI is not ready. During 13 seconds kubelet upstream connection error and '[tls] error: unexpected EOF' logs are generated ~7K times
[2024/05/24 07:34:21] [error] [tls] error: unexpected EOF
[2024/05/24 07:34:21] [error] [filter:kubernetes:kubernetes.1] kubelet upstream connection error
[...]
[2024/05/24 07:34:33] [error] [filter:kubernetes:kubernetes.1] kubelet upstream connection error
[2024/05/24 07:34:34] [error] [tls] error: unexpected EOF
Example 2
New node is starting, fluent-bit is trying to connect to kubelet where certificate is not issued.
For each connection attempt kubelet is generating error no serving certificate available for the kubelet
Jun 4 04:01:41 ip-A-B-C-D.eu-central-1.compute.internal kernel: process '/fluent-bit/bin/fluent-bit' started with executable stack
Jun 4 04:01:41 ip-A-B-C-D.eu-central-1.compute.internal kubelet: I0604 04:01:41.752427 3438 log.go:194] http: TLS handshake error from 127.0.0.1:53014: no serving certificate available for the kubelet
[...]
Jun 4 04:01:42 ip-A-B-C-D.eu-central-1.compute.internal kubelet: I0604 04:01:42.983070 3438 csr.go:261] certificate signing request csr-b7bnj is approved, waiting to be issued
[...]
Jun 4 04:01:43 ip-A-B-C-D.eu-central-1.compute.internal kubelet: I0604 04:01:43.041195 3438 log.go:194] http: TLS handshake error from 127.0.0.1:58166: no serving certificate available for the kubelet
Jun 4 04:01:43 ip-A-B-C-D.eu-central-1.compute.internal kubelet: I0604 04:01:43.096856 3438 csr.go:257] certificate signing request csr-b7bnj is issued
As per the above kubelet logs it takes 2 seconds to approve CSR and issue kubelet certificate. 577 kubelet upstream connection error logs were generated.
Expected behavior Fluent bit should not so aggressively try to connect to kubelet and generate so many error logs. It should delay the connection for 1 second after unsuccessful attempt to give kubelet and CNI time to become ready.
Your Environment
- Version used: v3.0.1, v2.2.2
- Configuration: Fluent-bit is configured to use kubelet to get metadata
- Environment name and version: EKS v1.26
Additional context These error logs are forwarded to the logging server and take a lot of space for big and dynamic clusters.
I have this error when I used fluent bit 3.0.7. And i am ok with fluent bit 3.0.4. There is nothing different with my config when I use these two version.
Also experiencing this issue with fluentbit v3.1.4 and EKS v1.29 if Use_Kubelet is set to On in the kubernetes filter.
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.
This issue was closed because it has been stalled for 5 days with no activity.
Bump, looks like this is still happening with AWS for Fluent Bit Container Image Version 2.32.4 (Fluent Bit v1.9.10) and default fluent-bit config from https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Container-Insights-setup-logs-FluentBit.html
Still an error, version 4.0.1 fluent-bit
This actually generates an enormous log volume as there is a feedback loop of failing to reach the kublet, then logging an error about not being able to reach the kublet, then fluent-bit ingesting and attempting to parse that error and triggering another error. So essentially thousands of logs are generated at node start-up.
We're going to attempt to resolve this with an init container on fluent bit and some form of the following:
TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)
curl -k -H "Authorization: Bearer $TOKEN" https://127.0.0.1:10250/healthz