Reading from Filesystem buffer not working as expected
Bug Report
I have tail as input and kafka as output with no filters in between. The fluentbit has filesystem buffering enabled. In order to test the buffering, I update the kafka brokers endpoint to non-existing URL and the chunks are getting stored in filesystem. However when I revert back the broker endpoint to the correct one, chunks are not read until the fluent bit is restarted. Moreover as soon as the broker endpoint is reverted, the input plugin goes into paused phase with no errors
Some pods of fluent bit get restarted automatically due to [engine] caught signal (SIGSEGV) and after the restart they start to send the chunks in filesystem
To Reproduce Create the fluent bit configuration file as follows
[SERVICE]
Daemon Off
Flush {{ .Values.flush }}
Log_Level {{ .Values.logLevel }}
Parsers_File /fluent-bit/etc/parsers.conf
Parsers_File /fluent-bit/etc/conf/custom_parsers.conf
HTTP_Server On
HTTP_Listen 0.0.0.0
HTTP_Port {{ .Values.metricsPort }}
Health_Check On
storage.path /var/log/flb-storage/
storage.sync normal
storage.checksum off
storage.max_chunks_up 20
storage.backlog.mem_limit 100M
[INPUT]
Name tail
Tag logs.*
Path /var/log/containers/test-*.log
DB /var/log/flb-storage/test.db
DB.locking true
Buffer_Chunk_Size 1M
Buffer_Max_Size 1M
Refresh_Interval 2
Rotate_Wait 10
multiline.parser cri
storage.type filesystem
[OUTPUT]
Name kafka
Match logs.*
Brokers brokers.svc.cluster.local:9092
Topics fluentbit
Retry_Limit False
queue_full_retries 0
rdkafka.client.id fluentbit
rdkafka.compression.type gzip
rdkafka.request.required.acks 1
rdkafka.batch.num.messages 8192
rdkafka.linger.ms 100
rdkafka.batch.size 1000000
- Set the memory and cpu limits are following
resources:
requests:
cpu: 200m
memory: 300Mi
limits:
memory: 300Mi
Logs after reverting to the correct broker endpoint
[2024/04/16 06:41:50] [ info] [task] re-schedule retry=0x7ff016ffdca0 1293 in the next 1 seconds
[2024/04/16 06:41:50] [ info] [task] re-schedule retry=0x7ff0288b0298 3 in the next 1 seconds
[2024/04/16 06:41:50] [ info] [task] re-schedule retry=0x7ff0288ac030 514 in the next 1 seconds
[2024/04/16 06:41:50] [ info] [task] re-schedule retry=0x7ff016ffbfe0 1109 in the next 1 seconds
[2024/04/16 06:41:50] [ info] [task] re-schedule retry=0x7ff0288ad4a8 643 in the next 1 seconds
[2024/04/16 06:41:50] [ info] [task] re-schedule retry=0x7ff0288b3ee8 332 in the next 1 seconds
[2024/04/16 06:41:50] [ info] [task] re-schedule retry=0x7ff0288aaeb0 446 in the next 1 seconds
[2024/04/16 06:41:50] [ info] [task] re-schedule retry=0x7ff0288b0950 44 in the next 1 seconds
[2024/04/16 06:41:50] [ info] [task] re-schedule retry=0x7ff0288ac648 551 in the next 1 seconds
[2024/04/16 06:41:50] [ info] [task] re-schedule retry=0x7ff016ffd688 1256 in the next 1 seconds
[2024/04/16 06:41:50] [ info] [task] re-schedule retry=0x7ff0288aaaa0 962 in the next 1 seconds
[2024/04/16 06:41:50] [ info] [task] re-schedule retry=0x7ff0288b0f90 84 in the next 1 seconds
[2024/04/16 06:41:50] [ info] [task] re-schedule retry=0x7ff0288b0928 45 in the next 1 seconds
[2024/04/16 06:41:50] [ info] [task] re-schedule retry=0x7ff0288ae290 734 in the next 1 seconds
[2024/04/16 06:41:50] [ info] [task] re-schedule retry=0x7ff016ffa140 915 in the next 1 seconds
[2024/04/16 06:41:50] [ info] [task] re-schedule retry=0x7ff0288afe38 832 in the next 1 seconds
[2024/04/16 06:41:50] [debug] [input:tail:tail.0] scanning path /var/log/containers/test-*.log
[2024/04/16 06:41:50] [debug] [input:tail:tail.0] scan_blog add(): dismissed: /var/log/containers/test-799f75dc8c-hj9c2_istio-ingress_istio-proxy-21ab260e422835d4edf770cbd611f8260aa889bcbfd711e95c7533fe9afcfdc3.log, inode 7428934
[2024/04/16 06:41:50] [debug] [input:tail:tail.0] 0 new files found on path '/var/log/containers/test-*.log'
[2024/04/16 06:41:50] [ info] [input] pausing tail.0
[2024/04/16 06:41:50] [ info] [input] pausing storage_backlog.1
[2024/04/16 06:41:50] [debug] [out flush] cb_destroy coro_id=8447
[2024/04/16 06:41:50] [debug] [out flush] cb_destroy coro_id=8448
[2024/04/16 06:41:50] [debug] [retry] re-using retry for task_id=1287 attempts=63
[2024/04/16 06:41:50] [ warn] [engine] failed to flush chunk '1-1713247149.336603650.flb', retry in 1 seconds: task_id=1287, input=tail.0 > output=kafka.0 (out_id=0)
[2024/04/16 06:41:50] [debug] [out flush] cb_destroy coro_id=8449
[2024/04/16 06:41:50] [debug] [retry] re-using retry for task_id=934 attempts=66
[2024/04/16 06:41:50] [ warn] [engine] failed to flush chunk '1-1713246912.337083377.flb', retry in 1 seconds: task_id=934, input=tail.0 > output=kafka.0 (out_id=0)
[2024/04/16 06:41:50] [debug] [out flush] cb_destroy coro_id=8450
[2024/04/16 06:41:50] [debug] [retry] re-using retry for task_id=1124 attempts=73
[2024/04/16 06:41:50] [ warn] [engine] failed to flush chunk '1-1713247068.243738741.flb', retry in 1 seconds: task_id=1124, input=tail.0 > output=kafka.0 (out_id=0)
[2024/04/16 06:41:50] [debug] [out flush] cb_destroy coro_id=8451
[2024/04/16 06:41:50] [debug] [out flush] cb_destroy coro_id=8452
Logs for Pods which got restarted automatically after reverting to correct broker endpoint
[2024/04/16 06:00:43] [2024/04/16 06:00:43] [ warn] [output:kafka:kafka.0] message delivery failed: Local: Message timed out
[engine] caught signal (SIGSEGV)
[2024/04/16 06:00:43] [ warn] [output:kafka:kafka.0] message delivery failed: Local: Message timed out
[2024/04/16 06:00:43] [ warn] [output:kafka:kafka.0] message delivery failed: Local: Message timed out
[2024/04/16 06:00:43] [ warn] [output:kafka:kafka.0] message delivery failed: Local: Message timed out
[2024/04/16 06:00:43] [ warn] [output:kafka:kafka.0] message delivery failed: Local: Message timed out
[2024/04/16 06:00:43] [ warn] [output:kafka:kafka.0] message delivery failed: Local: Message timed out
[2024/04/16 06:00:43] [ warn] [output:kafka:kafka.0] message delivery failed: Local: Message timed out
[2024/04/16 06:00:43] [debug] [output:kafka:kafka.0] enqueued message (1491 bytes) for topic 'fluentbit-manual-ingress-general-dev-01'
#0 0x55ec729ef047 in msgpack_unpack_next() at lib/msgpack-c/src/unpack.c:677
#1 0x55ec72540998 in flb_log_event_decoder_next() at src/flb_log_event_decoder.c:298
#2 0x55ec725da132 in cb_kafka_flush() at plugins/out_kafka/kafka.c:496
#3 0x55ec72a09d26 in co_init() at lib/monkey/deps/flb_libco/amd64.c:117
~/Desktop/Codebase/DevOps/sonic-kubernetes test-fluentd kubectl version
Expected behavior The fluent bit should not pause the input plugins and read the chunks from the filesystem to push them to output plugin
Screenshots
Your Environment
- Version used: 3.0.2
- Environment name and version kubernetes v1.23.17-eks-508b6b3
- Filters and plugins: Tail input and kafka output
Additional context we are trying to push ingress logs of istio written to a file to kafka using tail input of fluent bit
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.
Hi, we experienced the same issue. My understanding is this. Your max chunks up is quite low. Chunks are 2 MB so 20 chunks max is 40 MB mem. You defined storage.backlog.mem_limit 100M, so it will only start to flush when there is 100MB free of the 40 MB, which will obviously never happen. Please try to decrease storage.backlog.mem_limit to 5M (the default) or increase max chunks up to allow Fluentbit to use more memory.
Let me know if this helps.
I would like to mention the issue we are facing as well: https://github.com/fluent/fluent-bit/issues/9115
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.
This issue was closed because it has been stalled for 5 days with no activity.