bug: fluent-bit process once again hangs sometimes after being restarted
Describe the issue
Some time ago the fluentbit-watcher has been reworked to utilise the hot-reload feature https://github.com/fluent/fluent-operator/commit/90d364bfc1d42187eaddc600b1a6077fa1cfa4f5
This also meant removal of the SIGKILL call when the process is hanging. And so the issue that I initially reported in #510 has been reintroduced.
This is something that ideally would be fixed in fluent-bit itself (and I will report it there as well once I investigate this problem more in-depth and can reproduce it consistently...), but in the meantime I think it would be great to have handling for these situations reintroduced in fluent-operator.
To Reproduce
No clear steps to reproduce. Seems to happen when fluent-bit is restarted many times in a row, but not always
Expected behavior
Fluent-bit is restarted and works
Your Environment
- Fluent Operator version:
- Container Runtime:
- Operating system:
- Kernel version:
How did you install fluent operator?
No response
Additional context
Keeping this as somewhat of a remainder go get back to this after 18.11 or so
I think this issue might be due to Fluent Bit. I will try to reproduce and test it.
Sure, I've been testing livenessProbe as a workaround to restart the pod when it happens, not sure if it works yet. Here is a log from when the issue happens:
level=info time=2024-10-07T14:23:52Z msg="Config file changed, reloading..."
[2024/10/07 14:23:52] [engine] caught signal (SIGHUP)
level=info time=2024-10-07T14:23:52Z msg="Config file changed, reloading..."
level=info time=2024-10-07T14:23:52Z msg="Config file changed, reloading..."
[2024/10/07 14:23:52] [engine] caught signal (SIGHUP)
[2024/10/07 14:23:52] [2024/10/07 14:23:52] [error] reloading in progress, aborting.
[engine] caught signal (SIGHUP)
[2024/10/07 14:23:52] [error] reloading in progress, aborting.
[2024/10/07 14:23:52] [error] reloading in progress, aborting.
level=info time=2024-10-07T15:35:46Z msg="Config file changed, reloading..."
[2024/10/07 15:35:46] [engine] caught signal (SIGHUP)
[2024/10/07 15:35:46] [error] reloading in progress, aborting.
level=info time=2024-10-07T15:35:46Z msg="Config file changed, reloading..."
level=info time=2024-10-07T15:35:46Z msg="Config file changed, reloading..."
[2024/10/07 15:35:46] [engine] caught signal (SIGHUP)
[2024/10/07 15:35:46] [error] reloading in progress, aborting.
level=info time=2024-10-07T16:35:34Z msg="Config file changed, reloading..."
[2024/10/07 16:35:34] [engine] caught signal (SIGHUP)
[2024/10/07 16:35:34] [error] reloading in progress, aborting.
level=info time=2024-10-07T16:35:34Z msg="Config file changed, reloading..."
level=info time=2024-10-07T16:35:34Z msg="Config file changed, reloading..."
[2024/10/07 16:35:34] [engine] caught signal (SIGHUP)
[2024/10/07 16:35:34] [2024/10/07 16:35:34] [error] reloading in progress, aborting.
[engine] caught signal (SIGHUP)
[2024/10/07 16:35:34] [error] reloading in progress, aborting.
And nothing happens after that. The process is still running in the pod, but logs are not collected. I have not thought to check if the server is responsive, but I will if it see it happen again.
Ultimately I think this issue could be closed and moved to fluent-bit's repo, perhaps this shouldn't be fixed on fluent-operator as any "fix" would be just a workaround.
I've added some more info on an existing fluent-bit issue: https://github.com/fluent/fluent-bit/issues/9354#issuecomment-2493656577
@wenchajun @benjaminhuo - what is your opinion on this, is a workaround for this problem something that should be once again added to fluent-operator? Or should we wait until this problem is resolved on fluent-bit (uncertain when)?