helm-charts icon indicating copy to clipboard operation
helm-charts copied to clipboard

[fluent-bit] Liveness/Readiness probe fails when running on AWS EKS 1.20 with Bottlerocket OS

Open z0rc opened this issue 3 years ago • 5 comments

Cross reference to https://github.com/fluent/fluent-bit/issues/3521

Pods deployed by this chart fail liveness probe when running on AWS EKS 1.20 with Bottlerocket OS. As far I was able to understand, fluent-bit's http_server stops listening almost immediately after pod start, which leads to failed readiness/liveness probes. This happens only on nodes with Bottlerocket OS, regular nodes with Amazon Linux 2 run fluent-bit pods just fine.

To reproduce:

  • Run AWS EKS 1.20 with Bottlerocket nodes https://docs.aws.amazon.com/eks/latest/userguide/launch-node-bottlerocket.html
  • Deploy fluent-bit helm chart with default values, but dummy input and null output and empty filters
  • Observe how fluent-bit pods go into CrashLoopBackOff

z0rc avatar May 20 '21 14:05 z0rc

As a workaround, you should be able to disable the probes until https://github.com/fluent/fluent-bit/issues/3521 is resolved.

https://github.com/bottlerocket-os/bottlerocket/issues/1628 is tracking a workaround on the Bottlerocket side. Downgrading to Bottlerocket 1.19 is also supposed to fix the issue.

gabegorelick avatar Jun 30 '21 16:06 gabegorelick

Fixed with https://github.com/bottlerocket-os/bottlerocket/releases/tag/v1.1.3

But this is kinda workaround, by making kubelet's cpuManagerPolicy: none as default. With cpuManagerPolicy: static, the issue still persists.

z0rc avatar Jul 13 '21 10:07 z0rc

How do I disable the probes? "enabled: false" isn't an accepted value in the chart

richardFontaine avatar Nov 18 '21 16:11 richardFontaine

Fixed with https://github.com/bottlerocket-os/bottlerocket/releases/tag/v1.1.3

But this is kinda workaround, by making kubelet's cpuManagerPolicy: none as default. With cpuManagerPolicy: static, the issue still persists.

Agree. verified this issue persists in fluentbit 1.1 and k8s 1.22, with cpuManagerPolicy set to "static". this is just a temporary solution, and not a permanent fix for the underlying issue.

gengwg avatar Mar 24 '23 06:03 gengwg

I got the same issue (not in all nodes, but some of them) running EKS 1.24 with Bootlerocket AMI running Fluentbit 2.0.1

image

trombini77 avatar Oct 11 '23 17:10 trombini77