fluent-bit Kubernetes container logs missing after update to the 3.1.6 docker image

Bug Report

Describe the bug After attempting to updating the docker image used for fluent-bit in our k8s clusters to 3.1.6, with no additional changes to the configuration, I noticed that our tail input for k8s container logs stopped functioning. We have a second tail input in our deployed config that is looking at other logs on the same mount/PVC that is working properly. Rolling back to 3.1.4 fixes the issue, and both 3.1.5 and 3.1.6 resulted in the same problem.

Debug logs show that the log files for input_containers are being identified on the filesystem and watched, but log lines are never being processed. Exported prometheus metrics also show that going from 3.1.4 to 3.1.6 the input total metric only on that one tail input drops to 0, but metrics for files closed and rotated continue to tick up as expected. Setting Inotify_Watcher to false has no impact.

In attempts to rule out potential environment-specific oddities with our k8s clusters, I was also able to reproduce this in a local k8s setup using kind. I've also systematically removed all extraneous inputs, filters, and outputs from our fluent-bit config in an attempt to isolate the problem.

To Reproduce

Steps to reproduce the problem:
1. Deploy fluent-bit with the below tail input config as a daemonset or deployment into a k8s cluster using version 3.1.4 to see container logs and metrics to validate success.
2. Update fluent-bit image to 3.1.6 (or 3.1.5) and verify whether container logs still work in output and metrics.

Expected behavior Container logs continue to process through fluent-bit without changing configuration.

Your Environment

Version used: 3.1.6 (and 3.1.5)
Configuration:

[SERVICE]
    daemon Off
    flush 1
    log_level info
    health_check On
    storage_metrics On
[INPUT]
    name fluentbit_metrics
    alias input_prometheus
    tag metrics.internal
    scrape_interval 10
[INPUT]
    name tail
    alias input_containers
    path /var/log/containers/*.log
    multiline.parser docker, cri
    tag kube.*
    mem_buf_limit 5MB
    skip_long_lines On
[OUTPUT]
    name prometheus_exporter
    alias output_prometheus
    match metrics.*
    host 0.0.0.0
    port 2020
[OUTPUT]
    name splunk
    alias output_splunk_containers
    match kube.*
    host ${SPLUNK_HEC_HOST}
    port 443
    splunk_token ${SPLUNK_HEC_TOKEN}
    event_index default_platform_index
    event_host ${K8S_POD_NAME}
    TLS on
    TLS.verify on

Environment name and version (e.g. Kubernetes? What version?):
- EKS; Kubernetes 1.28
- kind, Kubernetes 1.31
Server type and version:
Operating System and version:
- EKS on Amazon Linux 2, amd64
- kind on macOS Sonoma 14.5 (M3)
Filters and plugins: tail, fluentbit_metrics, prometheus_exporter, splunk (local testing using stdout)

Additional context This initially started as routine maintenance to get us onto the latest release series (previously on 2.1 series). Updating to 3.1.4 "checks the box", but we're now in an unfortunate predicament for future updates. I was convinced that it's an environment-specific issue, but now that I've been able to consistently reproduce with kind it felt appropriate to escalate in case others are facing the same issue.

Aug 26 '24 19:08 diresqrl

@diresqrl thanks for reporting the problem

I am looking at the changes on the components in v3.1.5 where its not working for you and we have this:

release v3.1.5

ff651842b in_tail: fix double-free on exception (CID 507963)

^ this is just a small fix when a memory allocation fails, should not be related.

other not-related changes:

bec603400 log_event_decoder: updated code to use aligned memory reads 7f037486e core: added aligned memory read functions f2f6b1d80 core: added a byte order detection abstraction macro f9e6def57 build: added an option to enforce memory alignment 68931d121 in_exec_wasi: Provide configurable stack and heap sizes for Wasm 03423cfa0 filter_wasm: Provide configurable heap and stack sizes for Wasm 3045fd6be wasm: Make configurable heap and stack sizes with a struct type f54b370cd out_stackdriver: fix leak on exception (CID 508239) e84d9ff94 out_kafka: add missing initialization (CID 507783) 54de999c6 in_forward: fix leak on exception (CID 507786) 1fc6eedfd in_emitter: fix use-after-free on exception (CID 507860) b26d85b13 in_forward: fix leak on exception (CID 508219) f5794a417 out_prometheus_exporter: Handle multiply concatenated metrics type of events (#9122) 13ea609e5 cmake: windows: Enable Kafka plugins on Windows a6980efcf appveyor: Use vcpkg to install the latest OpenSSL 725640616 workflows: add sanity check for compilation using system libraries to pr-compile-check.yaml 999e9b837 build: use the system provided LuaJIT if found c683d8e3c out_opensearch: fixed wrong payload buffer usage for traces 30b6522b1 restore --staged tests/internal/aws_util.c 9392bc112 lib: ctraces: upgrade to v0.5.4 c147d452b lib: cmetrics: upgrade to v0.9.3 56ff251d3 in_mqtt: added buffer size setting and fixed a leak (#9163) a86dceed2 lib: cfl: upgrade to v0.5.2 d58f2336b workflows: update unstable nightly builds for 3.0 (#9168) e19b2ab14 out_oracle_log_analytics: set NULL to prevent double free 3a37eb8f6 out_oracle_log_analytics: fix mk_list cleanup function 7ce4aa6a0 out_oracle_log_analytics: add flb_sds_destroy for key 209095d69 out_oracle_log_analytics: remove flb_errno that checks NULL c15ca2fba test: internal: gzip: Add testcases for payloads of concatenated gzip a4956ccba in_forward: Use extracted function for processing concatenated gzip 87ea26d65 gzip: Extract and unify code for concatenated gzip payloads a6aac459d in_node_exporter_metrics: Align the collecting metrics of unit statuses (#9134) 3d4ad3173 filter_log_to_metrics: add new option discard_logs and code cleanup 8f0317f77 workflows: Fix CentOS7 build failure for EPEL (#9157) bc0768600 in_kubernetes_events: add chunked streaming test 2929a3d46 in_kubernetes_events: fix end of chunked stream deadlock 57c04a5e0 build: libraries: update path to c-ares 6a293f7e7 lib: c-ares: ugprade to v1.32.3 f1d99ca27 out_s3: Plug memory leaks on gzipped buffer during the swapping contents 6df1f2bf7 workflows: bump ossf/scorecard-action from 2.3.3 to 2.4.0 (#9137)

hmm I don't have a clue of what could be the problem. are you able to provide Fluent Bit logs ?

Aug 27 '24 15:08 edsiper

I can confirm we have the same problem after upgrading from fluent-bit 3.1.2 to 3.1.6. I can also confirm that the issue started in 3.1.5.

If I remove the log_to_metrics filter it works again.

    [FILTER]
        name               log_to_metrics
        match              kube.*
        tag                log_counter_metric
        metric_mode        counter
        metric_name        kubernetes_messages
        metric_description This metric counts Kubernetes messages
        kubernetes_mode    true

@diresqrl do you also use this filter?

maybe it has something to do with this new feature [Log_To_Metrics (Filter)] Add new option discard_logs https://github.com/fluent/fluent-bit/pull/9150/files I even tried to explicitly set discard_logs false but without success.

Aug 28 '24 19:08 reneeckstein

@reneeckstein After reviewing the way our configuration is being built, yes, we do have the log_to_metrics filter. I apparently missed that in my previous testing. I was also able to confirm that removing that filter does make log processing function again both in the kind/local environment and in our live k8s clusters.

Aug 29 '24 15:08 diresqrl

We experience the same. We were upgrading from 3.0.7 to 3.1.6 as that would hopefully resolve a small memory leak that caused our fluentbit PODs to crash over time. See this: https://github.com/fluent/fluent-bit/issues/9189

However no logs were forwarded anymore from the tail input plugin. 3.1.4 does not have this forwarding issue. (We use the forward output plugin) Note that we also use the log_to_metrics filter

Sep 04 '24 15:09 aligthart

As far as I can see (on one cluster) in fluent-bit v3.1.7 the issue seems to be solved already. Probably related to this one -> https://github.com/fluent/fluent-bit/pull/9252

This PR addresses an issue where the incorrect data type was used for the mode option which caused the config map handler to overwrite the adjacent fields (discard_logs at least) in 64 bit systems.

Sep 04 '24 15:09 reneeckstein

I can confirm that the 3.1.7 update does fix this problem for me, at least in immediate testing.

Sep 04 '24 16:09 diresqrl

same here. 3.1.7 forwards logs again.

Sep 04 '24 17:09 aligthart

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

Jan 22 '25 02:01 github-actions[bot]

This issue was closed because it has been stalled for 5 days with no activity.

Feb 01 '25 02:02 github-actions[bot]