fluent-bit Hot Reload _sometimes

Hey All,

We've been putting together a POC here to use the hot reload feature of fluent-bit. The general idea is that some custom CRD will be installed, a customer k8s operator we have will listen for the CRD and read its values and then add a file to the fluent-bit configmap which defines a new output, then our operator will send an HTTP request to fluent-bit to reload itself and pick up the newly added configmap file. Our fluent-bit.conf file looks like so:

[SERVICE]
    Flush 60
    Log_Level debug
    Parsers_File /fluent-bit/etc/parsers.conf
    Parsers_File /fluent-bit/etc/conf/custom_parsers.conf
    HTTP_Server On
    HTTP_Listen 0.0.0.0
    HTTP_Port 2020
    Health_Check On
    Hot_Reload On

@INCLUDE *_inputs.conf
@INCLUDE *_filters.conf
@INCLUDE *_outputs.conf

When we go to add a new file to the configmap we'll name it something like hello-world_outputs.conf, then POST to the "reload endpoint", and expect those logs to start flowing...

50% of the time this works 100% of the time :)

Whenever we hit the reload endpoint we always see the relevant logging about catching the sighup signal and see fluent-bit recycle itself from within... so that seems to work. What doesn't always work is fluentbit actually sending those logs to the new OUTPUT destination. We don't see anything within the logs themselves to suggest things are amiss even with debug enabled. Fluentbit is clearly picking things up as we see logs related to the store_dir value (this is an S3 plugin) but logs never end up making it to the bucket. We've got the upload_timeout set to 60s and the total_file_size set to 5M. We've waited hours and still nothing starts streaming even though we have test pods that dumping loads of logs per second.

This is somewhat reproducible for us in that, the very first time we add the new file to the existing configmap and hit the reload endpoint, things don't work. If we delete the file from the configmap then hit the reload endpoint then add that file back to the configmap and then hit the reload endpoint again things seem to start working as add as that may seem.

We tried latest fluent-bit on the 2.x series, as well as the 3.x series, and still the same behavior.

What does work for us though is if we add the noted file to the existing configmap, then effectively do a rolling restart, fluent-bit will 100% of the time pick up the new file and start sending logs as expected.

Any ideas on what might be going on? Is this a potential bug? Anything else I can give you all to help diagnose the issue? Again ... this doesn't seem to happen all the time but something like 50% of the time if I had to guess. It's really weird and odd behavior and feels like we either may just be getting lucky when it does work, or there is some cache issue at play, or something else.

It should also be noted we have, by default, about 10 or so outputs defined. I don't know if that matters one way or another but just putting it out there in case this is somehow related to load or too many outputs or whatever. For testing purposes we did trim the "default outputs" down to just 1 but that didn't seem to help at all.

Any help or pointers would be greatly appreciated as we would love to be able to go to production with just hitting the "reload endpoint" and not have to call k8s api to recycle our fluent-bit pods manually.

EDIT: something else worth noting ... whenever we hit the GET endpoint directly after a reload it's always empty. No json or anything.

Thanks, Chris

Jun 05 '24 17:06 cdancy

tagging @edsiper @PettitWesley as you all have helped us before and are very knowledgeable in this area.

Jun 05 '24 17:06 cdancy

@patrick-stephens we're seeing this too.

Jul 08 '24 14:07 stevehipwell

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

Oct 07 '24 02:10 github-actions[bot]

@aydosman did we get to the bottom of why we thought this was happening? I vaguely remember something about the reload failing when something else in the configuration was erroring (I think we were seeing this when using reload to switch backends for the forward output).

Oct 07 '24 08:10 stevehipwell

I think @pwhelan has been working on some fixes around this as well we likely need to push upstream @stevehipwell.

Oct 07 '24 08:10 patrick-stephens

We needed to ensure two things: first, that a service reload actually worked (in our case, this involved restarting the pod, with the issue being the correct referencing of configuration files using the absolute path); and second, if Fluent Bit wasn’t in a good state (e.g., the reloader receiving a -2 response), we made sure that when certain configurations were updated (like an output host), the pod would be recreated instead of performing a reload.

Oct 07 '24 13:10 aydosman

@stevehipwell @patrick-stephens @aydosman for us, and has been mentioned and documented elsewhere (though not officially in fluent-bit docs), the problem seems to come from the fact that dynamic configuration files are not supported for hot reload.

Oct 07 '24 13:10 cdancy

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

Jan 07 '25 02:01 github-actions[bot]

/not-stale

Jan 07 '25 09:01 stevehipwell

@cosmo0920 any chance you could have a quick look at this?

Jan 07 '25 14:01 patrick-stephens

This is still an open issue for us. Hot reload does not work with glob paths in the fluent-bit config. We have an operator we run off to the side that has to rollout/restart our fluentbit deployment each time we make a change which is not ideal especially within bigger environments.

Jan 08 '25 15:01 cdancy

@cdancy When you say dynamic paths do you mean paths with file globs?

Apr 16 '25 14:04 pwhelan

@cdancy are you seeing the reload fail or just not be triggered?

Apr 16 '25 15:04 stevehipwell

Apologies for the late response as I've been out on vacation...

When you say dynamic paths do you mean paths with file globs?

@pwhelan exactly. Updated title of issue to convey as much.

are you seeing the reload fail or just not be triggered?

@stevehipwell just not triggered. Hitting the endpoint basically creates a no-op. I confirmed this is true for fluentbit 4.0.1 as well. We'd love to get this worked on and fixed as currently we have an operator that has to rollout restart any time we make a change.

EDIT: if someone has a dev image with a fix they propose I can certainly try that out if and when it's made available. Just scream in my direction.

Apr 29 '25 14:04 cdancy

@stevehipwell just not triggered. Hitting the endpoint basically creates a no-op. I confirmed this is true for fluentbit 4.0.1 as well. We'd love to get this worked on and fixed as currently we have an operator that has to rollout restart any time we make a change.

I'd say this was a third case, not triggered would be no webhook fired and is generally caused by an incorrect mounting of the ConfigMap.

Apr 29 '25 14:04 stevehipwell

@stevehipwell is there anything we can do on our end? When we do a rolling restart with our changes applied everything works as expected with the new configs, and config map files, picked up when using glob patterns. Here is what our default config-map looks like:

      [SERVICE]
          Flush 1
          Log_Level info
          Parsers_File /fluent-bit/etc/parsers.conf
          Parsers_File /fluent-bit/etc/conf/custom_parsers.conf
          HTTP_Server On
          HTTP_Listen 0.0.0.0
          HTTP_Port 2020
          Health_Check On
          Hot_Reload On
          storage.path /var/log/flb-storage/
          storage.sync full
          storage.checksum off
          storage.max_chunks_up 64

      @INCLUDE *_inputs.conf
      @INCLUDE *_filters.conf
      @INCLUDE *_outputs.conf

Our fluentbit deployment is very dynamic as we add/configure new outputs/etc at runtime. If there is some way we can configure this through the helm chart to get things picked up please let us know.

FYI: The absolute path of the glob patterns won't be known ahead of time and will be unique when they are added/applied at runtime.

Apr 29 '25 15:04 cdancy

@cdancy have you run a test to prove that the ConfigMap changes can be propagated to a running container by K8s?

Apr 29 '25 17:04 stevehipwell

@stevehipwell depends on what you mean by a "running containter". We've been using this model in production for over a year now: to get things to stick we effectively do a rolling-restart of the fluentbit daemonset to get the changes picked up. Here is the order of operations:

Update fluentbit ConfigMap with new uniquely named config files
Do a rolling restart on the daemonset to get the changes picked up by fluentbit

This works as expected ... but I'd like to not have to do step #2 and have fluentbit recognize that a new config map file has been added/updated/removed/whatever to the default config-map

Apr 29 '25 21:04 cdancy

@cdancy my question is have you validated that when you modify a ConfigMap a running pod with the same mount pattern can see the change? If your changes are not making it into the pod that'd explain why the rollout works but the reload endpoint doesn't.

Apr 29 '25 21:04 stevehipwell

@stevehipwell oh yes I can absolutely confirm that: just not sure how to get around it or even if it’s possible.

Apr 29 '25 23:04 cdancy

@stevehipwell oh yes I can absolutely confirm that: just not sure how to get around it or even if it’s possible.

@cdancy I just ran my own test at it takes in the order of 30s for the change to the Configmap to be visible in the container filesystem. If you're using the built in hot reload sidecar implementation this should work correctly, but if you're calling the endpoint from another system you may have a timing issue?

Apr 30 '25 11:04 stevehipwell

@stevehipwell how did you test things? Did you use glob patterns like I did and then add another unique file to the config map? Any other config you applied? What version of fluentbit?

I can give things another go this morning and see if waiting longer helps.

If we have the hot reload configured correctly do we also need to hit the endpoint? It sounds like you’re saying that is not necessary?

Apr 30 '25 12:04 cdancy

@cdancy I just tested that FB & the reloader can see new files added to the ConfigMap, I didn't test your configuration. The hot reload functionality hits the endpoint for you so you shouldn't need to do anything extra.

Apr 30 '25 12:04 stevehipwell

@stevehipwell I can confirm issue does indeed exist in our environment even with latest-and-greatest. We're not relying on the "configmap reloader" sidecar but instead relying on fluentbit to reload itself when the configmap, or any file within, changes.

May 02 '25 16:05 cdancy

@cdancy AFAIK and based on the docs Fluent Bit can't reload itself. We ship the sidecar in the official Helm chart for this purpose. As I mentioned above, if you're manually hitting the HTTP endpoint (or sending a signal) when the configmap changes you may well be triggering this before the changes are available in the pod.

May 02 '25 16:05 stevehipwell

As I mentioned above, if you're manually hitting the HTTP endpoint

@stevehipwell this is exactly what we're doing. I know you mentioned that it sometimes takes up to a half hour to get picked up? This would be way too long for us as we're expecting, once we hit that endpoint, that fluentbit will reload itself with the new config we previously applied. Is there any way to configure fluentbit to not wait a half hour? Maybe poll for configmap changes sooner rather than later?

Even when hitting that endpoint, and waiting for half an your with our glob paths, we're still not seeing the changes getting picked up.

May 02 '25 16:05 cdancy

@cdancy I didn't say half an hour, I said I'd seen a 30s delay between the API server accepting the modification to the ConfigMap and it propagating to the pod. If you call the reload endpoint before the propagation has happened it'll be a best a no-op and at worst it may cause issues if the files change while being reloaded. The config map reloader sidecar solves this problem as it only triggers when the propagation is complete. If you want to manually call the endpoint you'll need to figure out the max propagation delay and wait accordingly.

May 02 '25 17:05 stevehipwell

I didn't say half an hour, I said I'd seen a 30s delay between the API server accepting the modification to the ConfigMap and it propagating to the pod.

Got it ... is there no way to increase that time period within the fluenbit code?

May 02 '25 17:05 cdancy

@cdancy that's Kubernetes, it's nothing to do with Fluent Bit.

I'm not sure what you think is happening when you modify the ConfigMap and call a HTTP endpoint? They are two separate systems. Also could I also check that you're hitting all pods directly?

But just use the hot reload capability, that's what it's there for.

May 02 '25 18:05 stevehipwell

fluent-bit
fluent-bit copied to clipboard

Hot Reload _sometimes_ does not work

fluent-bit fluent-bit copied to clipboard

Hot Reload _sometimes_ does not work

fluent-bit
fluent-bit copied to clipboard