Observability into FluentBit Failures
Is your feature request related to a problem? Please describe.
Say you're running a FluentBit version prior to v1.9.3 and you use - in certain key names. This will cause a SEG issue in the FluentBit process and it will restart. How do I get visibility into failures like these. Since the wrapper causes the FluentBit process to restart, the Kubernetes pod will not restart.
Describe the solution you'd like
Some way to surface the issue so I can alert on it. In addition, I would like k8s-native like behavior so that if a new DaemonSet is being rolled out and the first pod of the new version is failing, it doesn't tear down the old pods.
Describe alternatives you've considered
I'm actually not sure how to work around this.
Additional context
No response
A potential solution is to add logic to the wrapper that
1. Catches FluentBit failures and increments a prometheus metric
2. Run a sidecar that implements the Pods readiness/liveness probes
@frankgreco This is a very good requirement.
Maybe we can add readiness/liveness probes to the watcher itself and of course we can add promehteus metrics to it as well
@benjaminhuo could you point me to where the watcher code lives?
The code of fluentbit watcher is here: https://github.com/fluent/fluent-operator/tree/master/cmd/fluent-watcher/fluentbit