fluent-operator icon indicating copy to clipboard operation
fluent-operator copied to clipboard

Hot reload issues

Open chrono2002 opened this issue 1 year ago • 4 comments

Describe the issue

We've got CI which deploys filters, parsers and outputs into several namespaces. It works like this: before deployment it deletes everything in namespace.

Started from version 2.7.0 we've got following errors:

[pod/fluent-bit-v4rzh/fluent-bit] level=info time=2024-07-19T14:05:03Z msg="Config file changed, reloading..."
[pod/fluent-bit-v4rzh/fluent-bit] level=info time=2024-07-19T14:05:03Z msg="Config file changed, reloading..."
[pod/fluent-bit-v4rzh/fluent-bit] level=info time=2024-07-19T14:05:03Z msg="Config file changed, reloading..."
[pod/fluent-bit-ctq8k/fluent-bit] level=info time=2024-07-19T14:05:06Z msg="Config file changed, reloading..."
[pod/fluent-bit-ctq8k/fluent-bit] level=info time=2024-07-19T14:05:06Z msg="Config file changed, reloading..."
[pod/fluent-bit-ctq8k/fluent-bit] level=info time=2024-07-19T14:05:06Z msg="Config file changed, reloading..."
[pod/fluent-bit-p9ftx/fluent-bit] level=info time=2024-07-19T14:05:17Z msg="Config file changed, reloading..."
[pod/fluent-bit-p9ftx/fluent-bit] level=info time=2024-07-19T14:05:17Z msg="Config file changed, reloading..."
[pod/fluent-bit-p9ftx/fluent-bit] level=info time=2024-07-19T14:05:17Z msg="Config file changed, reloading..."

Looks like it is reloading on every object deletion. And when parsers are deleted before filters, it stucks and crashes. Then restarts normally.

[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:39] [ info] [input:emitter:prod-playtest-ppp-rewrite] initializing
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:39] [ info] [input:emitter:prod-playtest-ppp-rewrite] storage_strategy='filesystem' (memory + filesystem)
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:39] [ info] [input:emitter:prod-dev04-rewrite] initializing
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:39] [ info] [input:emitter:prod-dev04-rewrite] storage_strategy='filesystem' (memory + filesystem)
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:39] [ info] [input:emitter:prod-qa16-rewrite] initializing
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:39] [ info] [input:emitter:prod-qa16-rewrite] storage_strategy='filesystem' (memory + filesystem)
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:39] [ info] [input:emitter:prod-qa18-rewrite] initializing
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:39] [ info] [input:emitter:prod-qa18-rewrite] storage_strategy='filesystem' (memory + filesystem)
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:39] [ info] [input:emitter:prod-qa10-rewrite] initializing
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:39] [ info] [input:emitter:prod-qa10-rewrite] storage_strategy='filesystem' (memory + filesystem)
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:39] [ info] [input:emitter:prod-dev03-rewrite] initializing
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:39] [ info] [input:emitter:prod-dev03-rewrite] storage_strategy='filesystem' (memory + filesystem)
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:39] [ info] [input:emitter:prod-qa19-rewrite] initializing
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:39] [ info] [input:emitter:prod-qa19-rewrite] storage_strategy='filesystem' (memory + filesystem)
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:39] [ info] [input:emitter:prod-qa20-rewrite] initializing
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:39] [ info] [input:emitter:prod-qa20-rewrite] storage_strategy='filesystem' (memory + filesystem)
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:39] [ info] [input:emitter:prod-qa17-rewrite] initializing
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:39] [ info] [input:emitter:prod-qa17-rewrite] storage_strategy='filesystem' (memory + filesystem)
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:39] [ info] [input:emitter:prod-qa11-rewrite] initializing
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:39] [ info] [input:emitter:prod-qa11-rewrite] storage_strategy='filesystem' (memory + filesystem)
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:39] [ info] [input:emitter:prod-qc-rewrite] initializing
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:39] [ info] [input:emitter:prod-qc-rewrite] storage_strategy='filesystem' (memory + filesystem)
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:39] [ info] [input:emitter:prod-mainline-qa-rewrite] initializing
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:39] [ info] [input:emitter:prod-mainline-qa-rewrite] storage_strategy='filesystem' (memory + filesystem)
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:39] [error] [filter:parser:parser.323] requested parser 'cw-meta-meta-server-json-message-time-field-60d52537bbd89f341cbf30ffd3c7677d' not found
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:39] [error] [filter:parser:parser.323] Invalid 'parser'
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:39] [error] Failed initialize filter parser.323
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:39] [error] [engine] filter initialization failed
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing tail.0
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing storage_backlog.1
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-playtest-re_emitter
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-qa09-re_emitter
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-xxx01-re_emitter
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-gd01-re_emitter
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-qa03-re_emitter
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-playtest-ppp-re_emitter
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-dev04-re_emitter
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-qa16-re_emitter
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-qa18-re_emitter
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-qa10-re_emitter
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-dev03-re_emitter
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-qa19-re_emitter
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-qa20-re_emitter
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-qa17-re_emitter
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-qa11-re_emitter
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-qc-re_emitter
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-mainline-qa-re_emitter
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-qa15-re_emitter
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-qa04-re_emitter
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-qa02-re_emitter
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-mainline-re_emitter
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-xxx02-re_emitter
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-qa01-re_emitter
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-ld01-re_emitter
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-dev02-re_emitter
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-qa07-re_emitter
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-qa12-re_emitter
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-consoles-re_emitter
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-dev01-re_emitter
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-qa14-re_emitter
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-qa05-re_emitter
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-xxx03-re_emitter
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-qa06-re_emitter
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-qa13-re_emitter
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-qa08-re_emitter
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-playtest-rewrite
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-qa09-rewrite
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-xxx01-rewrite
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-gd01-rewrite
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-qa03-rewrite
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-playtest-ppp-rewrite
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-dev04-rewrite
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-qa16-rewrite
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-qa18-rewrite
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-qa10-rewrite
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-dev03-rewrite
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-qa19-rewrite
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-qa20-rewrite
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-qa17-rewrite
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-qa11-rewrite
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-qc-rewrite
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-mainline-qa-rewrite
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [error] [reload] loaded configuration contains error(s). Reloading is aborted
[pod/fluent-bit-xp5lp/fluent-bit] reloading is aborted and exit
[pod/fluent-bit-xp5lp/fluent-bit] level=error time=2024-07-19T13:57:40Z msg="Failure during the run time of fluent-bit" error="failed to run fluent-bit: exit status 255"

To Reproduce

  • create several namespaces
  • create several parsers and filters in every namespace
  • delete then redeploy namespace

Expected behavior

  • reload only once when all operations in ns finishes
  • check config before reloading and delay

Your Environment

- Fluent Operator version: >2.7.0
- Container Runtime: containerd
- Operating system: ubuntu
- Kernel version:

How did you install fluent operator?

helm

Additional context

No response

chrono2002 avatar Jul 19 '24 14:07 chrono2002

reload only once when all operations in ns finishes

how does fluent-operator know when your operations are done?

in my opinion, you can control the create/delete orders in your CI system and this problem will be resolved.

cw-Guo avatar Jul 20 '24 19:07 cw-Guo

reload only once when all operations in ns finishes

how does fluent-operator know when your operations are done?

in my opinion, you can control the create/delete orders in your CI system and this problem will be resolved.

how exactly you're suggesting to control it? we have helm chart that simple install parsers, filters and outputs we've tried to place parsers section before filters, or filters section before parsers, no luck

chrono2002 avatar Jul 22 '24 08:07 chrono2002

@cw-Guo we use gitops to deploy and when we deploy a bigger application, many fluent-operator CRs gets created that seems to trigger many reload on fluent-bit pods.

This causes troubles for us as fluent-bit starts hanging from time to time (https://github.com/fluent/fluent-operator/issues/1332).

It seems, fluent-bit has some issues with hot reload: https://github.com/fluent/fluent-bit/issues/9354

While, these are most probably fluent-bit bugs, maybe being a bit more "kind" with the reload requests could help.

How about a solution that instead of immediately reload on every CR change, fluent-operator would "collect" the changes for some definable period (like 1 minute) and call a single reload only once if any change has happened during this period.

ping @markusthoemmes

Cajga avatar Sep 10 '24 13:09 Cajga

I'm not really active in this project right now, but I did solve this internally eventually. Essentially, I've created a script that gets the current reloads (GET "http://0.0.0.0:2020/api/v2/reload") and then runs a hot reload. Afterwards it gets the reloads again. If they are the same as before, retry the reload. The need for that was supposed to be fixed via https://github.com/fluent/fluent-bit/issues/8457 though, so now we should be able to handle the return value of the reload and retry on error.

markusthoemmes avatar Sep 10 '24 14:09 markusthoemmes