fluent-operator icon indicating copy to clipboard operation
fluent-operator copied to clipboard

bug: fluent-bit process once again hangs sometimes after being restarted

Open jjsiv opened this issue 1 year ago • 3 comments

Describe the issue

Some time ago the fluentbit-watcher has been reworked to utilise the hot-reload feature https://github.com/fluent/fluent-operator/commit/90d364bfc1d42187eaddc600b1a6077fa1cfa4f5

This also meant removal of the SIGKILL call when the process is hanging. And so the issue that I initially reported in #510 has been reintroduced.

This is something that ideally would be fixed in fluent-bit itself (and I will report it there as well once I investigate this problem more in-depth and can reproduce it consistently...), but in the meantime I think it would be great to have handling for these situations reintroduced in fluent-operator.

To Reproduce

No clear steps to reproduce. Seems to happen when fluent-bit is restarted many times in a row, but not always

Expected behavior

Fluent-bit is restarted and works

Your Environment

- Fluent Operator version:
- Container Runtime:
- Operating system:
- Kernel version:

How did you install fluent operator?

No response

Additional context

Keeping this as somewhat of a remainder go get back to this after 18.11 or so

jjsiv avatar Nov 08 '24 21:11 jjsiv

I think this issue might be due to Fluent Bit. I will try to reproduce and test it.

wenchajun avatar Nov 18 '24 02:11 wenchajun

Sure, I've been testing livenessProbe as a workaround to restart the pod when it happens, not sure if it works yet. Here is a log from when the issue happens:

level=info time=2024-10-07T14:23:52Z msg="Config file changed, reloading..."
[2024/10/07 14:23:52] [engine] caught signal (SIGHUP)
level=info time=2024-10-07T14:23:52Z msg="Config file changed, reloading..."
level=info time=2024-10-07T14:23:52Z msg="Config file changed, reloading..."
[2024/10/07 14:23:52] [engine] caught signal (SIGHUP)
[2024/10/07 14:23:52] [2024/10/07 14:23:52] [error] reloading in progress, aborting.
[engine] caught signal (SIGHUP)
[2024/10/07 14:23:52] [error] reloading in progress, aborting.
[2024/10/07 14:23:52] [error] reloading in progress, aborting.
level=info time=2024-10-07T15:35:46Z msg="Config file changed, reloading..."
[2024/10/07 15:35:46] [engine] caught signal (SIGHUP)
[2024/10/07 15:35:46] [error] reloading in progress, aborting.
level=info time=2024-10-07T15:35:46Z msg="Config file changed, reloading..."
level=info time=2024-10-07T15:35:46Z msg="Config file changed, reloading..."
[2024/10/07 15:35:46] [engine] caught signal (SIGHUP)
[2024/10/07 15:35:46] [error] reloading in progress, aborting.
level=info time=2024-10-07T16:35:34Z msg="Config file changed, reloading..."
[2024/10/07 16:35:34] [engine] caught signal (SIGHUP)
[2024/10/07 16:35:34] [error] reloading in progress, aborting.
level=info time=2024-10-07T16:35:34Z msg="Config file changed, reloading..."
level=info time=2024-10-07T16:35:34Z msg="Config file changed, reloading..."
[2024/10/07 16:35:34] [engine] caught signal (SIGHUP)
[2024/10/07 16:35:34] [2024/10/07 16:35:34] [error] reloading in progress, aborting.
[engine] caught signal (SIGHUP)
[2024/10/07 16:35:34] [error] reloading in progress, aborting.

And nothing happens after that. The process is still running in the pod, but logs are not collected. I have not thought to check if the server is responsive, but I will if it see it happen again.

Ultimately I think this issue could be closed and moved to fluent-bit's repo, perhaps this shouldn't be fixed on fluent-operator as any "fix" would be just a workaround.

jjsiv avatar Nov 18 '24 13:11 jjsiv

I've added some more info on an existing fluent-bit issue: https://github.com/fluent/fluent-bit/issues/9354#issuecomment-2493656577

@wenchajun @benjaminhuo - what is your opinion on this, is a workaround for this problem something that should be once again added to fluent-operator? Or should we wait until this problem is resolved on fluent-bit (uncertain when)?

jjsiv avatar Nov 22 '24 12:11 jjsiv