fluentd icon indicating copy to clipboard operation
fluentd copied to clipboard

Losing logs from fluent-bit to fluentd cluster during brief outages on the fluentd cluster

Open dashe-ops opened this issue 1 year ago • 2 comments

Describe the bug

Hi,

a log ship solution, we are using fluent-bit on client VM's and sending logs to a 3 node fluentd cluster.

if I stop all 3 nodes in fluentd cluster at same time say for example for 2 minutes then restart fluentd on all 3 nodes. When i check the shipped logs we are missing 60 seconds of logs from the 2 minutes offline period.

To Reproduce

writing a simple log to test, print the date every 1 second:

while sleep 1; do date; done > /tmp/test.log

on fluentd cluster stop the cluster (stop all 3 nodes at same time)

untar the test logfile and see timestamp of last log line

wait 2 minutes and restart the fluentd cluster

wait for new log from client to appear and untar and read first few lines.

if buffers worked as we expect there should be no lost data

everytime there is lost data

tail -5 ie1-abc01b-nxt.nxt.test-test_20230323_02a.log Thu Mar 23 15:56:51 UTC 2023 Thu Mar 23 15:56:52 UTC 2023 Thu Mar 23 15:56:53 UTC 2023 Thu Mar 23 15:56:54 UTC 2023 Thu Mar 23 15:56:55 UTC 2023

head ie1-abc01b-nxt.nxt.test-test_20230323_03a.log

Thu Mar 23 15:57:57 UTC 2023 Thu Mar 23 15:57:58 UTC 2023 Thu Mar 23 15:57:59 UTC 2023 Thu Mar 23 15:58:00 UTC 2023 Thu Mar 23 15:58:01 UTC 2023

in above example we've lost 1 minutes data

Expected behavior

if buffers worked as we expect there should be no lost data

Your Environment

- Fluentd version:4.4.2
- TD Agent version: 2.0.6
- Operating system:Centos7
- Kernel version: 3.10.0-1160.83.1.el7.x86_64

Your Configuration

client fluent-bit configuration:
logship-fluent-bit.conf

[SERVICE]
    # Flush
    # =====
    # set an interval of seconds before to flush records to a destination
    flush        30
[INPUT]
        name tail
        path /tmp/test.log
        path_key log_file
        tag i2.2y.default.sgb.${HOSTNAME}.<filename>
        tag_regex (\/.*\/)(?<filename>.+)
        Storage.type memory
        DB /var/log/logship/buffer/tail-0.db
        DB.locking true
        DB.journal_mode WAL
[OUTPUT]
        Name forward
        Match *
        Host ie1-logship-nxt.nxt.endpoint
        Port 80
        Compress gzip


And on fluentd cluster the config :
<system>
    workers 1
    rpc_endpoint 0.0.0.0:24724
</system>

<source>
    @type forward
    port 24224
    @id forward
</source>

<match i2**>
    @type file
    @id file
    compress gzip
    path /data/${tag[1]}/%Y/%m/%d/${tag[2]}/${tag[3]}/${tag[4]}.${tag[5]}.${tag[6]}-${tag[7]}_%Y%m%d_03a
    append
    <buffer tag,time>
        @type memory
        flush_thread_count 8
        chunk_limit_size 8M
        queue_limit_length 64
        retry_max_interval 30
        retry_max_times 1000
        flush_mode interval
        flush_interval 30s
    </buffer>
    <format>
        @type single_value
        message_key log
    </format>
</match>

<source>
@type monitor_agent
bind 0.0.0.0
port 24220
@id monitor_agent
</source>

Your Error Log

no error in logs, just missing data

Additional context

No response

dashe-ops avatar Mar 23 '23 16:03 dashe-ops