helm-charts [fluent-bit] Liveness/Readiness probe fails when running on AWS EKS 1.20 with Bottlerocket OS

[fluent-bit] Liveness/Readiness probe fails when running on AWS EKS 1.20 with Bottlerocket OS

Open z0rc opened this issue 3 years ago • 5 comments

Cross reference to https://github.com/fluent/fluent-bit/issues/3521

Pods deployed by this chart fail liveness probe when running on AWS EKS 1.20 with Bottlerocket OS. As far I was able to understand, fluent-bit's http_server stops listening almost immediately after pod start, which leads to failed readiness/liveness probes. This happens only on nodes with Bottlerocket OS, regular nodes with Amazon Linux 2 run fluent-bit pods just fine.

To reproduce:

Run AWS EKS 1.20 with Bottlerocket nodes https://docs.aws.amazon.com/eks/latest/userguide/launch-node-bottlerocket.html
Deploy fluent-bit helm chart with default values, but dummy input and null output and empty filters
Observe how fluent-bit pods go into CrashLoopBackOff

May 20 '21 14:05 z0rc

As a workaround, you should be able to disable the probes until https://github.com/fluent/fluent-bit/issues/3521 is resolved.

https://github.com/bottlerocket-os/bottlerocket/issues/1628 is tracking a workaround on the Bottlerocket side. Downgrading to Bottlerocket 1.19 is also supposed to fix the issue.

Jun 30 '21 16:06 gabegorelick

Fixed with https://github.com/bottlerocket-os/bottlerocket/releases/tag/v1.1.3

But this is kinda workaround, by making kubelet's cpuManagerPolicy: none as default. With cpuManagerPolicy: static, the issue still persists.

Jul 13 '21 10:07 z0rc

How do I disable the probes? "enabled: false" isn't an accepted value in the chart

Nov 18 '21 16:11 richardFontaine

Fixed with https://github.com/bottlerocket-os/bottlerocket/releases/tag/v1.1.3

But this is kinda workaround, by making kubelet's cpuManagerPolicy: none as default. With cpuManagerPolicy: static, the issue still persists.

Agree. verified this issue persists in fluentbit 1.1 and k8s 1.22, with cpuManagerPolicy set to "static". this is just a temporary solution, and not a permanent fix for the underlying issue.

Mar 24 '23 06:03 gengwg

I got the same issue (not in all nodes, but some of them) running EKS 1.24 with Bootlerocket AMI running Fluentbit 2.0.1

Oct 11 '23 17:10 trombini77

helm-charts helm-charts copied to clipboard

[fluent-bit] Liveness/Readiness probe fails when running on AWS EKS 1.20 with Bottlerocket OS

helm-charts
helm-charts copied to clipboard