fluent-bit Reading from Filesystem buffer not working as expected

Bug Report

I have tail as input and kafka as output with no filters in between. The fluentbit has filesystem buffering enabled. In order to test the buffering, I update the kafka brokers endpoint to non-existing URL and the chunks are getting stored in filesystem. However when I revert back the broker endpoint to the correct one, chunks are not read until the fluent bit is restarted. Moreover as soon as the broker endpoint is reverted, the input plugin goes into paused phase with no errors

Some pods of fluent bit get restarted automatically due to [engine] caught signal (SIGSEGV) and after the restart they start to send the chunks in filesystem

To Reproduce Create the fluent bit configuration file as follows

    [SERVICE]
        Daemon Off
        Flush {{ .Values.flush }}
        Log_Level {{ .Values.logLevel }}
        Parsers_File /fluent-bit/etc/parsers.conf
        Parsers_File /fluent-bit/etc/conf/custom_parsers.conf
        HTTP_Server On
        HTTP_Listen 0.0.0.0
        HTTP_Port {{ .Values.metricsPort }}
        Health_Check On
        storage.path              /var/log/flb-storage/
        storage.sync              normal
        storage.checksum          off
        storage.max_chunks_up     20
        storage.backlog.mem_limit 100M

    [INPUT]
        Name tail
        Tag logs.*
        Path /var/log/containers/test-*.log
        DB /var/log/flb-storage/test.db
        DB.locking true
        Buffer_Chunk_Size 1M
        Buffer_Max_Size 1M
        Refresh_Interval 2
        Rotate_Wait 10
        multiline.parser cri
        storage.type filesystem

    [OUTPUT]
        Name kafka
        Match logs.*
        Brokers brokers.svc.cluster.local:9092
        Topics fluentbit
        Retry_Limit False
        queue_full_retries 0
        rdkafka.client.id fluentbit
        rdkafka.compression.type gzip
        rdkafka.request.required.acks 1
        rdkafka.batch.num.messages 8192
        rdkafka.linger.ms 100
        rdkafka.batch.size 1000000

Set the memory and cpu limits are following

resources:
  requests:
    cpu: 200m
    memory: 300Mi
  limits:
    memory: 300Mi

Logs after reverting to the correct broker endpoint

[2024/04/16 06:41:50] [ info] [task] re-schedule retry=0x7ff016ffdca0 1293 in the next 1 seconds
[2024/04/16 06:41:50] [ info] [task] re-schedule retry=0x7ff0288b0298 3 in the next 1 seconds
[2024/04/16 06:41:50] [ info] [task] re-schedule retry=0x7ff0288ac030 514 in the next 1 seconds
[2024/04/16 06:41:50] [ info] [task] re-schedule retry=0x7ff016ffbfe0 1109 in the next 1 seconds
[2024/04/16 06:41:50] [ info] [task] re-schedule retry=0x7ff0288ad4a8 643 in the next 1 seconds
[2024/04/16 06:41:50] [ info] [task] re-schedule retry=0x7ff0288b3ee8 332 in the next 1 seconds
[2024/04/16 06:41:50] [ info] [task] re-schedule retry=0x7ff0288aaeb0 446 in the next 1 seconds
[2024/04/16 06:41:50] [ info] [task] re-schedule retry=0x7ff0288b0950 44 in the next 1 seconds
[2024/04/16 06:41:50] [ info] [task] re-schedule retry=0x7ff0288ac648 551 in the next 1 seconds
[2024/04/16 06:41:50] [ info] [task] re-schedule retry=0x7ff016ffd688 1256 in the next 1 seconds
[2024/04/16 06:41:50] [ info] [task] re-schedule retry=0x7ff0288aaaa0 962 in the next 1 seconds
[2024/04/16 06:41:50] [ info] [task] re-schedule retry=0x7ff0288b0f90 84 in the next 1 seconds
[2024/04/16 06:41:50] [ info] [task] re-schedule retry=0x7ff0288b0928 45 in the next 1 seconds
[2024/04/16 06:41:50] [ info] [task] re-schedule retry=0x7ff0288ae290 734 in the next 1 seconds
[2024/04/16 06:41:50] [ info] [task] re-schedule retry=0x7ff016ffa140 915 in the next 1 seconds
[2024/04/16 06:41:50] [ info] [task] re-schedule retry=0x7ff0288afe38 832 in the next 1 seconds
[2024/04/16 06:41:50] [debug] [input:tail:tail.0] scanning path /var/log/containers/test-*.log
[2024/04/16 06:41:50] [debug] [input:tail:tail.0] scan_blog add(): dismissed: /var/log/containers/test-799f75dc8c-hj9c2_istio-ingress_istio-proxy-21ab260e422835d4edf770cbd611f8260aa889bcbfd711e95c7533fe9afcfdc3.log, inode 7428934
[2024/04/16 06:41:50] [debug] [input:tail:tail.0] 0 new files found on path '/var/log/containers/test-*.log'
[2024/04/16 06:41:50] [ info] [input] pausing tail.0
[2024/04/16 06:41:50] [ info] [input] pausing storage_backlog.1
[2024/04/16 06:41:50] [debug] [out flush] cb_destroy coro_id=8447
[2024/04/16 06:41:50] [debug] [out flush] cb_destroy coro_id=8448
[2024/04/16 06:41:50] [debug] [retry] re-using retry for task_id=1287 attempts=63
[2024/04/16 06:41:50] [ warn] [engine] failed to flush chunk '1-1713247149.336603650.flb', retry in 1 seconds: task_id=1287, input=tail.0 > output=kafka.0 (out_id=0)
[2024/04/16 06:41:50] [debug] [out flush] cb_destroy coro_id=8449
[2024/04/16 06:41:50] [debug] [retry] re-using retry for task_id=934 attempts=66
[2024/04/16 06:41:50] [ warn] [engine] failed to flush chunk '1-1713246912.337083377.flb', retry in 1 seconds: task_id=934, input=tail.0 > output=kafka.0 (out_id=0)
[2024/04/16 06:41:50] [debug] [out flush] cb_destroy coro_id=8450
[2024/04/16 06:41:50] [debug] [retry] re-using retry for task_id=1124 attempts=73
[2024/04/16 06:41:50] [ warn] [engine] failed to flush chunk '1-1713247068.243738741.flb', retry in 1 seconds: task_id=1124, input=tail.0 > output=kafka.0 (out_id=0)
[2024/04/16 06:41:50] [debug] [out flush] cb_destroy coro_id=8451
[2024/04/16 06:41:50] [debug] [out flush] cb_destroy coro_id=8452

Logs for Pods which got restarted automatically after reverting to correct broker endpoint

[2024/04/16 06:00:43] [2024/04/16 06:00:43] [ warn] [output:kafka:kafka.0] message delivery failed: Local: Message timed out
[engine] caught signal (SIGSEGV)
[2024/04/16 06:00:43] [ warn] [output:kafka:kafka.0] message delivery failed: Local: Message timed out
[2024/04/16 06:00:43] [ warn] [output:kafka:kafka.0] message delivery failed: Local: Message timed out
[2024/04/16 06:00:43] [ warn] [output:kafka:kafka.0] message delivery failed: Local: Message timed out
[2024/04/16 06:00:43] [ warn] [output:kafka:kafka.0] message delivery failed: Local: Message timed out
[2024/04/16 06:00:43] [ warn] [output:kafka:kafka.0] message delivery failed: Local: Message timed out
[2024/04/16 06:00:43] [ warn] [output:kafka:kafka.0] message delivery failed: Local: Message timed out
[2024/04/16 06:00:43] [debug] [output:kafka:kafka.0] enqueued message (1491 bytes) for topic 'fluentbit-manual-ingress-general-dev-01'
#0  0x55ec729ef047      in  msgpack_unpack_next() at lib/msgpack-c/src/unpack.c:677
#1  0x55ec72540998      in  flb_log_event_decoder_next() at src/flb_log_event_decoder.c:298
#2  0x55ec725da132      in  cb_kafka_flush() at plugins/out_kafka/kafka.c:496
#3  0x55ec72a09d26      in  co_init() at lib/monkey/deps/flb_libco/amd64.c:117
 ~/Desktop/Codebase/DevOps/sonic-kubernetes   test-fluentd  kubectl version

Expected behavior The fluent bit should not pause the input plugins and read the chunks from the filesystem to push them to output plugin

Screenshots

Your Environment

Version used: 3.0.2
Environment name and version kubernetes v1.23.17-eks-508b6b3
Filters and plugins: Tail input and kafka output

Additional context we are trying to push ingress logs of istio written to a file to kafka using tail input of fluent bit

Apr 16 '24 06:04 ShahroZafar

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

Jul 16 '24 01:07 github-actions[bot]

Hi, we experienced the same issue. My understanding is this. Your max chunks up is quite low. Chunks are 2 MB so 20 chunks max is 40 MB mem. You defined storage.backlog.mem_limit 100M, so it will only start to flush when there is 100MB free of the 40 MB, which will obviously never happen. Please try to decrease storage.backlog.mem_limit to 5M (the default) or increase max chunks up to allow Fluentbit to use more memory.

Let me know if this helps.

I would like to mention the issue we are facing as well: https://github.com/fluent/fluent-bit/issues/9115

Jul 20 '24 06:07 RicharddeJong

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

Oct 20 '24 02:10 github-actions[bot]

This issue was closed because it has been stalled for 5 days with no activity.

Oct 26 '24 02:10 github-actions[bot]