fluent-operator Observability into FluentBit Failures

Is your feature request related to a problem? Please describe.

Say you're running a FluentBit version prior to v1.9.3 and you use - in certain key names. This will cause a SEG issue in the FluentBit process and it will restart. How do I get visibility into failures like these. Since the wrapper causes the FluentBit process to restart, the Kubernetes pod will not restart.

Describe the solution you'd like

Some way to surface the issue so I can alert on it. In addition, I would like k8s-native like behavior so that if a new DaemonSet is being rolled out and the first pod of the new version is failing, it doesn't tear down the old pods.

Describe alternatives you've considered

I'm actually not sure how to work around this.

Additional context

No response

Jun 11 '22 00:06 frankgreco

A potential solution is to add logic to the wrapper that

1. Catches FluentBit failures and increments a prometheus metric
2. Run a sidecar that implements the Pods readiness/liveness probes

Jun 11 '22 00:06 frankgreco

@frankgreco This is a very good requirement. Maybe we can add readiness/liveness probes to the watcher itself and of course we can add promehteus metrics to it as well

Jun 13 '22 10:06 benjaminhuo

@benjaminhuo could you point me to where the watcher code lives?

Jun 14 '22 17:06 frankgreco

The code of fluentbit watcher is here: https://github.com/fluent/fluent-operator/tree/master/cmd/fluent-watcher/fluentbit

Jun 15 '22 02:06 benjaminhuo