fluent-bit icon indicating copy to clipboard operation
fluent-bit copied to clipboard

filter: aws metadata group retries

Open mwarzynski opened this issue 2 years ago • 16 comments

PR adds handling errors received while fetching metadata groups in the Filter AWS functionality. It also allows to define the retries interval if such scenario occurs.

retry_interval_s   Defines minimum duration between retries for fetching metadata groups
                   > default: 300, type: integer

Our current behaviour is to not inject any AWS metadata if fetching just one piece of information fails. It might be not the best approach as most likely we would like to inject as much as we can. In other words, it would be better to have some additional AWS labels, if we can fetch them, rather than nothing.

For example, on current master, if Instance Metadata server misbehaves and our implementation fails to fetch EC2 tags, then output logs won't contain any AWS Metadata labels. Even the basic ones like EC2 Instance ID. Logs pushed in such a way might be effectively useless as system won't be able to attribute them to particular instance.

Proposed changes in this PR have an effect in scope of fetching all metadata groups (and a failure to do so). For example, if fetching EC2 Tags fails, then other AWS metadata information should still be injected into the logs, at least under the assumption that Filter AWS is able to fetch them without any errors. Retries for fetching the metadata groups will be done at the interval not smaller than the specified retry_interval_s.

PR is related to the following comment: https://github.com/fluent/fluent-bit/pull/7129#issuecomment-1500736721.


IIRC from my testing that you already handle the tag server 404 not enabled gracefully? But not any other case.

Yes, we handle 404 errors gracefully. 404 has no effect on other AWS metadata. In case of 404, we just mark tags as fetched and proceed (i.e. it is the same case when instance has 0 tags).


Example configuration

[SERVICE]
    flush        1
    daemon       Off
    log_level    debug


[INPUT]
    name stdin

[FILTER]
    Name aws
    imds_version v1
    Match *
    ec2_instance_id true
    tags_enabled true
    retry_interval_s 5

[OUTPUT]
    name  stdout
    match *

Debug log output

Fluent Bit v2.1.5
* Copyright (C) 2015-2022 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2023/06/20 23:11:23] [ info] Configuration:
[2023/06/20 23:11:23] [ info]  flush time     | 1.000000 seconds
[2023/06/20 23:11:23] [ info]  grace          | 5 seconds
[2023/06/20 23:11:23] [ info]  daemon         | 0
[2023/06/20 23:11:23] [ info] ___________
[2023/06/20 23:11:23] [ info]  inputs:
[2023/06/20 23:11:23] [ info]      stdin
[2023/06/20 23:11:23] [ info] ___________
[2023/06/20 23:11:23] [ info]  filters:
[2023/06/20 23:11:23] [ info]      aws.0
[2023/06/20 23:11:23] [ info] ___________
[2023/06/20 23:11:23] [ info]  outputs:
[2023/06/20 23:11:23] [ info]      stdout.0
[2023/06/20 23:11:23] [ info] ___________
[2023/06/20 23:11:23] [ info]  collectors:
[2023/06/20 23:11:23] [ info] [fluent bit] version=2.1.5, commit=1f9e707217, pid=2641
[2023/06/20 23:11:23] [debug] [engine] coroutine stack size: 24576 bytes (24.0K)
[2023/06/20 23:11:23] [ info] [storage] ver=1.4.0, type=memory, sync=normal, checksum=off, max_chunks_up=128
[2023/06/20 23:11:23] [ info] [cmetrics] version=0.6.2
[2023/06/20 23:11:23] [ info] [ctraces ] version=0.3.1
[2023/06/20 23:11:23] [ info] [input:stdin:stdin.0] initializing
[2023/06/20 23:11:23] [ info] [input:stdin:stdin.0] storage_strategy='memory' (memory only)
[2023/06/20 23:11:23] [debug] [stdin:stdin.0] created event channels: read=21 write=22
[2023/06/20 23:11:23] [debug] [input:stdin:stdin.0] buf_size=16000
[2023/06/20 23:11:23] [debug] [imds] using IMDSv1
[2023/06/20 23:11:23] [debug] [http_client] not using http_proxy for header
[2023/06/20 23:11:23] [debug] [imds] using IMDSv1
[2023/06/20 23:11:23] [debug] [http_client] not using http_proxy for header
[2023/06/20 23:11:23] [debug] [imds] using IMDSv1
[2023/06/20 23:11:23] [debug] [http_client] not using http_proxy for header
[2023/06/20 23:11:23] [debug] [aws_client] (null): http_do=0, HTTP Status: 500
[2023/06/20 23:11:23] [error] [filter:aws:aws.0] Failed to get instance EC2 Tags
[2023/06/20 23:11:23] [debug] [stdout:stdout.0] created event channels: read=24 write=25
[2023/06/20 23:11:23] [ info] [sp] stream processor started
[2023/06/20 23:11:23] [ info] [output:stdout:stdout.0] worker #0 started
[2023/06/20 23:11:23] [debug] [input chunk] update output instances with new chunk size diff=95, records=1, input=stdin.0
[2023/06/20 23:11:24] [debug] [task] created task=0x7f6a68024ad0 id=0 OK
[2023/06/20 23:11:24] [debug] [output:stdout:stdout.0] task_id=0 assigned to thread #0
[0] stdin.0: [[1687302683.893142993, {}], {"log"=>"emfavbsxim", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168"}]
[2023/06/20 23:11:24] [debug] [out flush] cb_destroy coro_id=0
[2023/06/20 23:11:24] [debug] [task] destroy task=0x7f6a68024ad0 (task_id=0)
[2023/06/20 23:11:24] [debug] [input chunk] update output instances with new chunk size diff=95, records=1, input=stdin.0
[2023/06/20 23:11:24] [debug] [input chunk] update output instances with new chunk size diff=95, records=1, input=stdin.0
[2023/06/20 23:11:25] [debug] [task] created task=0x7f6a68028e60 id=0 OK
[2023/06/20 23:11:25] [debug] [output:stdout:stdout.0] task_id=0 assigned to thread #0
[0] stdin.0: [[1687302684.374381047, {}], {"log"=>"wowducooxe", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168"}]
[1] stdin.0: [[1687302684.875083562, {}], {"log"=>"yudtfbsgvl", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168"}]
[2023/06/20 23:11:25] [debug] [out flush] cb_destroy coro_id=1
[2023/06/20 23:11:25] [debug] [task] destroy task=0x7f6a68028e60 (task_id=0)
[2023/06/20 23:11:25] [debug] [input chunk] update output instances with new chunk size diff=95, records=1, input=stdin.0
[2023/06/20 23:11:25] [debug] [input chunk] update output instances with new chunk size diff=95, records=1, input=stdin.0
[2023/06/20 23:11:26] [debug] [task] created task=0x7f6a68029180 id=0 OK
[2023/06/20 23:11:26] [debug] [output:stdout:stdout.0] task_id=0 assigned to thread #0
[0] stdin.0: [[1687302685.375792267, {}], {"log"=>"lsniojweym", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168"}]
[1] stdin.0: [[1687302685.876517552, {}], {"log"=>"wfwobquqbo", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168"}]
[2023/06/20 23:11:26] [debug] [out flush] cb_destroy coro_id=2
[2023/06/20 23:11:26] [debug] [task] destroy task=0x7f6a68029180 (task_id=0)
[2023/06/20 23:11:26] [debug] [input chunk] update output instances with new chunk size diff=95, records=1, input=stdin.0
[2023/06/20 23:11:26] [debug] [input chunk] update output instances with new chunk size diff=95, records=1, input=stdin.0
[2023/06/20 23:11:27] [debug] [task] created task=0x7f6a68014620 id=0 OK
[2023/06/20 23:11:27] [debug] [output:stdout:stdout.0] task_id=0 assigned to thread #0
[0] stdin.0: [[1687302686.377204802, {}], {"log"=>"bqyshdcqyf", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168"}]
[1] stdin.0: [[1687302686.877946820, {}], {"log"=>"jfdkamekqc", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168"}]
[2023/06/20 23:11:27] [debug] [out flush] cb_destroy coro_id=3
[2023/06/20 23:11:27] [debug] [task] destroy task=0x7f6a68014620 (task_id=0)
[2023/06/20 23:11:27] [debug] [input chunk] update output instances with new chunk size diff=95, records=1, input=stdin.0
[2023/06/20 23:11:27] [debug] [input chunk] update output instances with new chunk size diff=95, records=1, input=stdin.0
[2023/06/20 23:11:28] [debug] [task] created task=0x7f6a68014750 id=0 OK
[2023/06/20 23:11:28] [debug] [output:stdout:stdout.0] task_id=0 assigned to thread #0
[0] stdin.0: [[1687302687.378669569, {}], {"log"=>"wnsxlhteid", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168"}]
[1] stdin.0: [[1687302687.879385922, {}], {"log"=>"jwpgcxkjtt", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168"}]
[2023/06/20 23:11:28] [debug] [out flush] cb_destroy coro_id=4
[2023/06/20 23:11:28] [debug] [task] destroy task=0x7f6a68014750 (task_id=0)
[2023/06/20 23:11:28] [debug] [imds] using IMDSv1
[2023/06/20 23:11:28] [debug] [http_client] not using http_proxy for header
[2023/06/20 23:11:28] [debug] [aws_client] (null): http_do=0, HTTP Status: 500
[2023/06/20 23:11:28] [error] [filter:aws:aws.0] Failed to get instance EC2 Tags
[2023/06/20 23:11:28] [debug] [input chunk] update output instances with new chunk size diff=95, records=1, input=stdin.0
[2023/06/20 23:11:28] [debug] [input chunk] update output instances with new chunk size diff=95, records=1, input=stdin.0
[2023/06/20 23:11:29] [debug] [task] created task=0x7f6a68015500 id=0 OK
[2023/06/20 23:11:29] [debug] [output:stdout:stdout.0] task_id=0 assigned to thread #0
[0] stdin.0: [[1687302688.380092365, {}], {"log"=>"gqshkfxqkf", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168"}]
[1] stdin.0: [[1687302688.880844758, {}], {"log"=>"ezlfgynfls", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168"}]
[2023/06/20 23:11:29] [debug] [out flush] cb_destroy coro_id=5
[2023/06/20 23:11:29] [debug] [task] destroy task=0x7f6a68015500 (task_id=0)
[2023/06/20 23:11:29] [debug] [input chunk] update output instances with new chunk size diff=95, records=1, input=stdin.0
[2023/06/20 23:11:29] [debug] [input chunk] update output instances with new chunk size diff=95, records=1, input=stdin.0
[2023/06/20 23:11:30] [debug] [task] created task=0x7f6a68015600 id=0 OK
[2023/06/20 23:11:30] [debug] [output:stdout:stdout.0] task_id=0 assigned to thread #0
[0] stdin.0: [[1687302689.381531701, {}], {"log"=>"yuvjowigrg", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168"}]
[1] stdin.0: [[1687302689.882267085, {}], {"log"=>"fsicnaqjmj", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168"}]
[2023/06/20 23:11:30] [debug] [out flush] cb_destroy coro_id=6
[2023/06/20 23:11:30] [debug] [task] destroy task=0x7f6a68015600 (task_id=0)
[2023/06/20 23:11:30] [debug] [input chunk] update output instances with new chunk size diff=95, records=1, input=stdin.0
[2023/06/20 23:11:30] [debug] [input chunk] update output instances with new chunk size diff=95, records=1, input=stdin.0
[2023/06/20 23:11:31] [debug] [task] created task=0x7f6a68015770 id=0 OK
[2023/06/20 23:11:31] [debug] [output:stdout:stdout.0] task_id=0 assigned to thread #0
[0] stdin.0: [[1687302690.382976576, {}], {"log"=>"uuvrsyqjez", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168"}]
[1] stdin.0: [[1687302690.883703719, {}], {"log"=>"skpdkpcwoz", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168"}]
[2023/06/20 23:11:31] [debug] [out flush] cb_destroy coro_id=7
[2023/06/20 23:11:31] [debug] [task] destroy task=0x7f6a68015770 (task_id=0)
[2023/06/20 23:11:31] [debug] [input chunk] update output instances with new chunk size diff=95, records=1, input=stdin.0
[2023/06/20 23:11:31] [debug] [input chunk] update output instances with new chunk size diff=95, records=1, input=stdin.0
[2023/06/20 23:11:32] [debug] [task] created task=0x7f6a68015770 id=0 OK
[2023/06/20 23:11:32] [debug] [output:stdout:stdout.0] task_id=0 assigned to thread #0
[0] stdin.0: [[1687302691.384421041, {}], {"log"=>"umrnngyscm", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168"}]
[1] stdin.0: [[1687302691.885152211, {}], {"log"=>"wjmlmvhzen", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168"}]
[2023/06/20 23:11:32] [debug] [out flush] cb_destroy coro_id=8
[2023/06/20 23:11:32] [debug] [task] destroy task=0x7f6a68015770 (task_id=0)
[2023/06/20 23:11:32] [debug] [input chunk] update output instances with new chunk size diff=95, records=1, input=stdin.0
[2023/06/20 23:11:32] [debug] [input chunk] update output instances with new chunk size diff=95, records=1, input=stdin.0
[0] stdin.0: [[1687302692.385868793, {}], {"log"=>"ufmlfusukc", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168"}]
[2023/06/20 23:11:33] [debug] [task] created task=0x7f6a68015770 id=0 OK
[1] stdin.0: [[1687302692.886602452, {}], {"log"=>"gjipkoiuej", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168"}]
[2023/06/20 23:11:33] [debug] [output:stdout:stdout.0] task_id=0 assigned to thread #0
[2023/06/20 23:11:33] [debug] [out flush] cb_destroy coro_id=9
[2023/06/20 23:11:33] [debug] [task] destroy task=0x7f6a68015770 (task_id=0)
[2023/06/20 23:11:33] [debug] [imds] using IMDSv1
[2023/06/20 23:11:33] [debug] [http_client] not using http_proxy for header
[2023/06/20 23:11:33] [debug] [aws_client] (null): http_do=0, HTTP Status: 500
[2023/06/20 23:11:33] [error] [filter:aws:aws.0] Failed to get instance EC2 Tags
[2023/06/20 23:11:33] [debug] [input chunk] update output instances with new chunk size diff=95, records=1, input=stdin.0
[2023/06/20 23:11:33] [debug] [input chunk] update output instances with new chunk size diff=95, records=1, input=stdin.0
[2023/06/20 23:11:34] [debug] [task] created task=0x7f6a680158c0 id=0 OK
[2023/06/20 23:11:34] [debug] [output:stdout:stdout.0] task_id=0 assigned to thread #0
[0] stdin.0: [[1687302693.387297097, {}], {"log"=>"mivzwiavcb", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168"}]
[1] stdin.0: [[1687302693.887836380, {}], {"log"=>"lgaxrstavl", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168"}]
[2023/06/20 23:11:34] [debug] [out flush] cb_destroy coro_id=10
[2023/06/20 23:11:34] [debug] [task] destroy task=0x7f6a680158c0 (task_id=0)
[2023/06/20 23:11:34] [debug] [input chunk] update output instances with new chunk size diff=95, records=1, input=stdin.0
[2023/06/20 23:11:34] [debug] [input chunk] update output instances with new chunk size diff=95, records=1, input=stdin.0
[2023/06/20 23:11:35] [debug] [task] created task=0x7f6a68014ae0 id=0 OK
[2023/06/20 23:11:35] [debug] [output:stdout:stdout.0] task_id=0 assigned to thread #0
[0] stdin.0: [[1687302694.388522201, {}], {"log"=>"omghhpnwkh", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168"}]
[1] stdin.0: [[1687302694.889260971, {}], {"log"=>"lvwjoyyiiu", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168"}]
[2023/06/20 23:11:35] [debug] [out flush] cb_destroy coro_id=11
[2023/06/20 23:11:35] [debug] [task] destroy task=0x7f6a68014ae0 (task_id=0)
[2023/06/20 23:11:35] [debug] [input chunk] update output instances with new chunk size diff=95, records=1, input=stdin.0
[2023/06/20 23:11:35] [debug] [input chunk] update output instances with new chunk size diff=95, records=1, input=stdin.0
[2023/06/20 23:11:36] [debug] [task] created task=0x7f6a68014ae0 id=0 OK
[2023/06/20 23:11:36] [debug] [output:stdout:stdout.0] task_id=0 assigned to thread #0
[0] stdin.0: [[1687302695.389975260, {}], {"log"=>"jhqldnhpsc", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168"}]
[1] stdin.0: [[1687302695.890727701, {}], {"log"=>"rbccjnnsuy", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168"}]
[2023/06/20 23:11:36] [debug] [out flush] cb_destroy coro_id=12
[2023/06/20 23:11:36] [debug] [task] destroy task=0x7f6a68014ae0 (task_id=0)
[2023/06/20 23:11:36] [debug] [input chunk] update output instances with new chunk size diff=95, records=1, input=stdin.0
[2023/06/20 23:11:36] [debug] [input chunk] update output instances with new chunk size diff=95, records=1, input=stdin.0
[2023/06/20 23:11:37] [debug] [task] created task=0x7f6a68014ae0 id=0 OK
[2023/06/20 23:11:37] [debug] [output:stdout:stdout.0] task_id=0 assigned to thread #0
[0] stdin.0: [[1687302696.391431990, {}], {"log"=>"tocfeqeimb", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168"}]
[1] stdin.0: [[1687302696.892173635, {}], {"log"=>"ioidqivahk", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168"}]
[2023/06/20 23:11:37] [debug] [out flush] cb_destroy coro_id=13
[2023/06/20 23:11:37] [debug] [task] destroy task=0x7f6a68014ae0 (task_id=0)
[2023/06/20 23:11:37] [debug] [input chunk] update output instances with new chunk size diff=95, records=1, input=stdin.0
[2023/06/20 23:11:37] [debug] [input chunk] update output instances with new chunk size diff=95, records=1, input=stdin.0
[2023/06/20 23:11:38] [debug] [task] created task=0x7f6a68014ae0 id=0 OK
[2023/06/20 23:11:38] [debug] [output:stdout:stdout.0] task_id=0 assigned to thread #0
[0] stdin.0: [[1687302697.392976327, {}], {"log"=>"mfkdikyeqk", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168"}]
[1] stdin.0: [[1687302697.893695426, {}], {"log"=>"bfntmmyyfl", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168"}]
[2023/06/20 23:11:38] [debug] [out flush] cb_destroy coro_id=14
[2023/06/20 23:11:38] [debug] [task] destroy task=0x7f6a68014ae0 (task_id=0)
[2023/06/20 23:11:38] [debug] [imds] using IMDSv1
[2023/06/20 23:11:38] [debug] [http_client] not using http_proxy for header
[2023/06/20 23:11:38] [debug] [aws_client] (null): http_do=0, HTTP Status: 500
[2023/06/20 23:11:38] [error] [filter:aws:aws.0] Failed to get instance EC2 Tags
[2023/06/20 23:11:38] [debug] [input chunk] update output instances with new chunk size diff=95, records=1, input=stdin.0
[2023/06/20 23:11:38] [debug] [input chunk] update output instances with new chunk size diff=95, records=1, input=stdin.0
[2023/06/20 23:11:39] [debug] [task] created task=0x7f6a680146a0 id=0 OK
[2023/06/20 23:11:39] [debug] [output:stdout:stdout.0] task_id=0 assigned to thread #0
[0] stdin.0: [[1687302698.394444930, {}], {"log"=>"xvkyupyiaf", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168"}]
[1] stdin.0: [[1687302698.895156811, {}], {"log"=>"tyhmasfucu", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168"}]
[2023/06/20 23:11:39] [debug] [out flush] cb_destroy coro_id=15
[2023/06/20 23:11:39] [debug] [task] destroy task=0x7f6a680146a0 (task_id=0)
[2023/06/20 23:11:39] [debug] [input chunk] update output instances with new chunk size diff=95, records=1, input=stdin.0
[2023/06/20 23:11:39] [debug] [input chunk] update output instances with new chunk size diff=95, records=1, input=stdin.0
[2023/06/20 23:11:40] [debug] [task] created task=0x7f6a68014e20 id=0 OK
[2023/06/20 23:11:40] [debug] [output:stdout:stdout.0] task_id=0 assigned to thread #0
[0] stdin.0: [[1687302699.395877353, {}], {"log"=>"smmaqjfvvb", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168"}]
[1] stdin.0: [[1687302699.896595048, {}], {"log"=>"rfapheuxnw", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168"}]
[2023/06/20 23:11:40] [debug] [out flush] cb_destroy coro_id=16
[2023/06/20 23:11:40] [debug] [task] destroy task=0x7f6a68014e20 (task_id=0)
[2023/06/20 23:11:40] [debug] [input chunk] update output instances with new chunk size diff=95, records=1, input=stdin.0
[2023/06/20 23:11:40] [debug] [input chunk] update output instances with new chunk size diff=95, records=1, input=stdin.0
[2023/06/20 23:11:41] [debug] [task] created task=0x7f6a68014e20 id=0 OK
[2023/06/20 23:11:41] [debug] [output:stdout:stdout.0] task_id=0 assigned to thread #0
[0] stdin.0: [[1687302700.397177930, {}], {"log"=>"csivjjqtbm", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168"}]
[1] stdin.0: [[1687302700.897929948, {}], {"log"=>"cqbwanmvbo", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168"}]
[2023/06/20 23:11:41] [debug] [out flush] cb_destroy coro_id=17
[2023/06/20 23:11:41] [debug] [task] destroy task=0x7f6a68014e20 (task_id=0)
[2023/06/20 23:11:41] [debug] [input chunk] update output instances with new chunk size diff=95, records=1, input=stdin.0
[2023/06/20 23:11:41] [debug] [input chunk] update output instances with new chunk size diff=95, records=1, input=stdin.0
[2023/06/20 23:11:42] [debug] [task] created task=0x7f6a68015670 id=0 OK
[0] stdin.0: [[1687302701.398645419, {}], {"log"=>"qrgxcwuvdj", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168"}]
[2023/06/20 23:11:42] [debug] [output:stdout:stdout.0] task_id=0 assigned to thread #0
[1] stdin.0: [[1687302701.899367671, {}], {"log"=>"jdenrvbrdg", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168"}]
[2023/06/20 23:11:42] [debug] [out flush] cb_destroy coro_id=18
[2023/06/20 23:11:42] [debug] [task] destroy task=0x7f6a68015670 (task_id=0)
[2023/06/20 23:11:42] [debug] [input chunk] update output instances with new chunk size diff=95, records=1, input=stdin.0
[2023/06/20 23:11:42] [debug] [input chunk] update output instances with new chunk size diff=95, records=1, input=stdin.0
[2023/06/20 23:11:43] [debug] [task] created task=0x7f6a68014db0 id=0 OK
[2023/06/20 23:11:43] [debug] [output:stdout:stdout.0] task_id=0 assigned to thread #0
[0] stdin.0: [[1687302702.400086133, {}], {"log"=>"xzhrixezzn", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168"}]
[1] stdin.0: [[1687302702.900802870, {}], {"log"=>"tzyohbknky", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168"}]
[2023/06/20 23:11:43] [debug] [out flush] cb_destroy coro_id=19
[2023/06/20 23:11:43] [debug] [task] destroy task=0x7f6a68014db0 (task_id=0)
[2023/06/20 23:11:43] [debug] [imds] using IMDSv1
[2023/06/20 23:11:43] [debug] [http_client] not using http_proxy for header
[2023/06/20 23:11:43] [debug] [aws_client] (null): http_do=0, HTTP Status: 500
[2023/06/20 23:11:43] [error] [filter:aws:aws.0] Failed to get instance EC2 Tags
[2023/06/20 23:11:43] [debug] [input chunk] update output instances with new chunk size diff=95, records=1, input=stdin.0
^B[2023/06/20 23:11:43] [debug] [input chunk] update output instances with new chunk size diff=95, records=1, input=stdin.0
n[2023/06/20 23:11:44] [debug] [task] created task=0x7f6a68015110 id=0 OK
[2023/06/20 23:11:44] [debug] [output:stdout:stdout.0] task_id=0 assigned to thread #0
[0] stdin.0: [[1687302703.401526941, {}], {"log"=>"qnszuutxyx", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168"}]
[1] stdin.0: [[1687302703.902202719, {}], {"log"=>"zdhiiabprw", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168"}]
[2023/06/20 23:11:44] [debug] [out flush] cb_destroy coro_id=20
[2023/06/20 23:11:44] [debug] [task] destroy task=0x7f6a68015110 (task_id=0)
[2023/06/20 23:11:44] [debug] [input chunk] update output instances with new chunk size diff=95, records=1, input=stdin.0
[2023/06/20 23:11:44] [debug] [input chunk] update output instances with new chunk size diff=95, records=1, input=stdin.0
[2023/06/20 23:11:45] [debug] [task] created task=0x7f6a68015110 id=0 OK
[2023/06/20 23:11:45] [debug] [output:stdout:stdout.0] task_id=0 assigned to thread #0
[0] stdin.0: [[1687302704.402955918, {}], {"log"=>"fkcsnqpovj", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168"}]
[1] stdin.0: [[1687302704.903687318, {}], {"log"=>"lgvmzbxyqh", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168"}]
[2023/06/20 23:11:45] [debug] [out flush] cb_destroy coro_id=21
[2023/06/20 23:11:45] [debug] [task] destroy task=0x7f6a68015110 (task_id=0)
[2023/06/20 23:11:45] [debug] [input chunk] update output instances with new chunk size diff=95, records=1, input=stdin.0
[2023/06/20 23:11:45] [debug] [input chunk] update output instances with new chunk size diff=95, records=1, input=stdin.0
[2023/06/20 23:11:46] [debug] [task] created task=0x7f6a68015110 id=0 OK
[2023/06/20 23:11:46] [debug] [output:stdout:stdout.0] task_id=0 assigned to thread #0
[0] stdin.0: [[1687302705.404424512, {}], {"log"=>"qrughiphru", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168"}]
[1] stdin.0: [[1687302705.905138969, {}], {"log"=>"qblwbyarlz", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168"}]
[2023/06/20 23:11:46] [debug] [out flush] cb_destroy coro_id=22
[2023/06/20 23:11:46] [debug] [task] destroy task=0x7f6a68015110 (task_id=0)
[2023/06/20 23:11:46] [debug] [input chunk] update output instances with new chunk size diff=95, records=1, input=stdin.0
[2023/06/20 23:11:46] [debug] [input chunk] update output instances with new chunk size diff=95, records=1, input=stdin.0
[2023/06/20 23:11:47] [debug] [task] created task=0x7f6a68015110 id=0 OK
[2023/06/20 23:11:47] [debug] [output:stdout:stdout.0] task_id=0 assigned to thread #0
[0] stdin.0: [[1687302706.405846141, {}], {"log"=>"ibnynivvtl", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168"}]
[1] stdin.0: [[1687302706.906569923, {}], {"log"=>"kxnkkusfuc", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168"}]
[2023/06/20 23:11:47] [debug] [out flush] cb_destroy coro_id=23
[2023/06/20 23:11:47] [debug] [task] destroy task=0x7f6a68015110 (task_id=0)
[2023/06/20 23:11:47] [debug] [input chunk] update output instances with new chunk size diff=95, records=1, input=stdin.0
[2023/06/20 23:11:47] [debug] [input chunk] update output instances with new chunk size diff=95, records=1, input=stdin.0
[2023/06/20 23:11:48] [debug] [task] created task=0x7f6a68015110 id=0 OK
[2023/06/20 23:11:48] [debug] [output:stdout:stdout.0] task_id=0 assigned to thread #0
[0] stdin.0: [[1687302707.407249328, {}], {"log"=>"didbtnzbzi", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168"}]
[1] stdin.0: [[1687302707.907962637, {}], {"log"=>"zwjgiobcwl", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168"}]
[2023/06/20 23:11:48] [debug] [out flush] cb_destroy coro_id=24
[2023/06/20 23:11:48] [debug] [task] destroy task=0x7f6a68015110 (task_id=0)
[2023/06/20 23:11:48] [debug] [imds] using IMDSv1
[2023/06/20 23:11:48] [debug] [http_client] not using http_proxy for header
[2023/06/20 23:11:48] [debug] [aws_client] (null): http_do=0, HTTP Status: 500
[2023/06/20 23:11:48] [error] [filter:aws:aws.0] Failed to get instance EC2 Tags
[2023/06/20 23:11:48] [debug] [input chunk] update output instances with new chunk size diff=95, records=1, input=stdin.0
[2023/06/20 23:11:48] [debug] [input chunk] update output instances with new chunk size diff=95, records=1, input=stdin.0
[2023/06/20 23:11:49] [debug] [task] created task=0x7f6a68015110 id=0 OK
[2023/06/20 23:11:49] [debug] [output:stdout:stdout.0] task_id=0 assigned to thread #0
[0] stdin.0: [[1687302708.408652963, {}], {"log"=>"wxuaxrdqlv", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168"}]
[1] stdin.0: [[1687302708.909365672, {}], {"log"=>"amjlhcahcd", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168"}]
[2023/06/20 23:11:49] [debug] [out flush] cb_destroy coro_id=25
[2023/06/20 23:11:49] [debug] [task] destroy task=0x7f6a68015110 (task_id=0)
[2023/06/20 23:11:49] [debug] [input chunk] update output instances with new chunk size diff=95, records=1, input=stdin.0
[2023/06/20 23:11:49] [debug] [input chunk] update output instances with new chunk size diff=95, records=1, input=stdin.0
[2023/06/20 23:11:50] [debug] [task] created task=0x7f6a68015110 id=0 OK
[2023/06/20 23:11:50] [debug] [output:stdout:stdout.0] task_id=0 assigned to thread #0
[0] stdin.0: [[1687302709.410108494, {}], {"log"=>"vmqkaqruoz", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168"}]
[1] stdin.0: [[1687302709.910848699, {}], {"log"=>"cdwvfgpdfd", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168"}]
[2023/06/20 23:11:50] [debug] [out flush] cb_destroy coro_id=26
[2023/06/20 23:11:50] [debug] [task] destroy task=0x7f6a68015110 (task_id=0)
[2023/06/20 23:11:50] [debug] [input chunk] update output instances with new chunk size diff=95, records=1, input=stdin.0
[2023/06/20 23:11:50] [debug] [input chunk] update output instances with new chunk size diff=95, records=1, input=stdin.0
[2023/06/20 23:11:51] [debug] [task] created task=0x7f6a68015110 id=0 OK
[2023/06/20 23:11:51] [debug] [output:stdout:stdout.0] task_id=0 assigned to thread #0
[0] stdin.0: [[1687302710.411479728, {}], {"log"=>"natzjrcjph", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168"}]
[1] stdin.0: [[1687302710.912165690, {}], {"log"=>"gphrsbiilv", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168"}]
[2023/06/20 23:11:51] [debug] [out flush] cb_destroy coro_id=27
[2023/06/20 23:11:51] [debug] [task] destroy task=0x7f6a68015110 (task_id=0)
[2023/06/20 23:11:51] [debug] [input chunk] update output instances with new chunk size diff=95, records=1, input=stdin.0
[2023/06/20 23:11:51] [debug] [input chunk] update output instances with new chunk size diff=95, records=1, input=stdin.0
[2023/06/20 23:11:52] [debug] [task] created task=0x7f6a68015110 id=0 OK
[2023/06/20 23:11:52] [debug] [output:stdout:stdout.0] task_id=0 assigned to thread #0
[0] stdin.0: [[1687302711.412908977, {}], {"log"=>"svxguhdmax", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168"}]
[1] stdin.0: [[1687302711.913619606, {}], {"log"=>"onfnllyvow", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168"}]
[2023/06/20 23:11:52] [debug] [out flush] cb_destroy coro_id=28
[2023/06/20 23:11:52] [debug] [task] destroy task=0x7f6a68015110 (task_id=0)
[2023/06/20 23:11:52] [debug] [input chunk] update output instances with new chunk size diff=95, records=1, input=stdin.0
[2023/06/20 23:11:52] [debug] [input chunk] update output instances with new chunk size diff=95, records=1, input=stdin.0
[2023/06/20 23:11:53] [debug] [task] created task=0x7f6a68015670 id=0 OK
[2023/06/20 23:11:53] [debug] [output:stdout:stdout.0] task_id=0 assigned to thread #0
[0] stdin.0: [[1687302712.414349521, {}], {"log"=>"pvoxzlsfml", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168"}]
[1] stdin.0: [[1687302712.915064821, {}], {"log"=>"qzhnpqdxxn", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168"}]
[2023/06/20 23:11:53] [debug] [out flush] cb_destroy coro_id=29
[2023/06/20 23:11:53] [debug] [task] destroy task=0x7f6a68015670 (task_id=0)
[2023/06/20 23:11:53] [debug] [imds] using IMDSv1
[2023/06/20 23:11:53] [debug] [http_client] not using http_proxy for header
[2023/06/20 23:11:53] [debug] [imds] using IMDSv1
[2023/06/20 23:11:53] [debug] [http_client] not using http_proxy for header
[2023/06/20 23:11:53] [debug] [imds] using IMDSv1
[2023/06/20 23:11:53] [debug] [http_client] not using http_proxy for header
[2023/06/20 23:11:53] [debug] [filter:aws:aws.0] found tag Name which is included=1
[2023/06/20 23:11:53] [debug] [filter:aws:aws.0] found tag CUSTOMER_ID which is included=1
[2023/06/20 23:11:53] [debug] [input chunk] update output instances with new chunk size diff=175, records=1, input=stdin.0
[2023/06/20 23:11:53] [debug] [input chunk] update output instances with new chunk size diff=175, records=1, input=stdin.0
[2023/06/20 23:11:54] [debug] [task] created task=0x7f6a68021340 id=0 OK
[2023/06/20 23:11:54] [debug] [output:stdout:stdout.0] task_id=0 assigned to thread #0
[0] stdin.0: [[1687302713.415772286, {}], {"log"=>"dcszivdqsk", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168", "Name"=>"mwarzynski-fluentbit-dev", "CUSTOMER_ID"=>"43476f15-4ea1-46ef-895b-47d8f5745ac9"}]
[1] stdin.0: [[1687302713.916451286, {}], {"log"=>"cvguhtpksg", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168", "Name"=>"mwarzynski-fluentbit-dev", "CUSTOMER_ID"=>"43476f15-4ea1-46ef-895b-47d8f5745ac9"}]
[2023/06/20 23:11:54] [debug] [out flush] cb_destroy coro_id=30
[2023/06/20 23:11:54] [debug] [task] destroy task=0x7f6a68021340 (task_id=0)
[2023/06/20 23:11:54] [debug] [input chunk] update output instances with new chunk size diff=175, records=1, input=stdin.0
[2023/06/20 23:11:54] [debug] [input chunk] update output instances with new chunk size diff=175, records=1, input=stdin.0
[2023/06/20 23:11:55] [debug] [task] created task=0x7f6a68021340 id=0 OK
[2023/06/20 23:11:55] [debug] [output:stdout:stdout.0] task_id=0 assigned to thread #0
[0] stdin.0: [[1687302714.417153963, {}], {"log"=>"oohllwocqs", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168", "Name"=>"mwarzynski-fluentbit-dev", "CUSTOMER_ID"=>"43476f15-4ea1-46ef-895b-47d8f5745ac9"}]
[1] stdin.0: [[1687302714.917882541, {}], {"log"=>"dfeewqqtvd", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168", "Name"=>"mwarzynski-fluentbit-dev", "CUSTOMER_ID"=>"43476f15-4ea1-46ef-895b-47d8f5745ac9"}]
[2023/06/20 23:11:55] [debug] [out flush] cb_destroy coro_id=31
[2023/06/20 23:11:55] [debug] [task] destroy task=0x7f6a68021340 (task_id=0)
[2023/06/20 23:11:55] [debug] [input chunk] update output instances with new chunk size diff=175, records=1, input=stdin.0
[2023/06/20 23:11:55] [debug] [input chunk] update output instances with new chunk size diff=175, records=1, input=stdin.0
[2023/06/20 23:11:56] [debug] [task] created task=0x7f6a68021340 id=0 OK
[2023/06/20 23:11:56] [debug] [output:stdout:stdout.0] task_id=0 assigned to thread #0
[0] stdin.0: [[1687302715.418593370, {}], {"log"=>"dzltwikdmr", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168", "Name"=>"mwarzynski-fluentbit-dev", "CUSTOMER_ID"=>"43476f15-4ea1-46ef-895b-47d8f5745ac9"}]
[1] stdin.0: [[1687302715.919313354, {}], {"log"=>"ookptsgpfk", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168", "Name"=>"mwarzynski-fluentbit-dev", "CUSTOMER_ID"=>"43476f15-4ea1-46ef-895b-47d8f5745ac9"}]
[2023/06/20 23:11:56] [debug] [out flush] cb_destroy coro_id=32
[2023/06/20 23:11:56] [debug] [task] destroy task=0x7f6a68021340 (task_id=0)
[2023/06/20 23:11:56] [debug] [input chunk] update output instances with new chunk size diff=175, records=1, input=stdin.0
^C[2023/06/20 23:11:56] [engine] caught signal (SIGINT)
[2023/06/20 23:11:56] [debug] [task] created task=0x7f6a68021340 id=0 OK
[2023/06/20 23:11:56] [debug] [output:stdout:stdout.0] task_id=0 assigned to thread #0
[2023/06/20 23:11:56] [ warn] [engine] service will shutdown in max 5 seconds
[0] stdin.0: [[1687302716.420024849, {}], {"log"=>"zmkzggzhan", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168", "Name"=>"mwarzynski-fluentbit-dev", "CUSTOMER_ID"=>"43476f15-4ea1-46ef-895b-47d8f5745ac9"}]
[2023/06/20 23:11:56] [debug] [out flush] cb_destroy coro_id=33
[2023/06/20 23:11:56] [debug] [task] destroy task=0x7f6a68021340 (task_id=0)
[2023/06/20 23:11:56] [ warn] [input:stdin:stdin.0] end of file (stdin closed by remote end)
[2023/06/20 23:11:56] [ warn] [engine] service will shutdown in max 5 seconds
[2023/06/20 23:11:57] [ info] [engine] service has stopped (0 pending tasks)
[2023/06/20 23:11:57] [ info] [output:stdout:stdout.0] thread worker #0 stopping...
[2023/06/20 23:11:57] [ info] [output:stdout:stdout.0] thread worker #0 stopped

Valgrind run

$ python3 logs_producer.py | valgrind ./bin/fluent-bit -c ./fluent-bit.conf
...
[2023/06/20 23:13:24] [debug] [out flush] cb_destroy coro_id=24
[2023/06/20 23:13:24] [debug] [task] destroy task=0x78c3b40 (task_id=0)
[2023/06/20 23:13:24] [debug] [imds] using IMDSv1
[2023/06/20 23:13:24] [debug] [http_client] not using http_proxy for header
[2023/06/20 23:13:24] [debug] [aws_client] (null): http_do=0, HTTP Status: 500
[2023/06/20 23:13:24] [error] [filter:aws:aws.0] Failed to get instance EC2 Tags
[2023/06/20 23:13:24] [debug] [input chunk] update output instances with new chunk size diff=95, records=1, input=stdin.0
[2023/06/20 23:13:24] [debug] [input chunk] update output instances with new chunk size diff=95, records=1, input=stdin.0
[0] stdin.0: [[1687302804.446433952, {}], {"log"=>"sjzkwkkhip", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168"}]
[2023/06/20 23:13:25] [debug] [task] created task=0x7950910 id=0 OK
[1] stdin.0: [[1687302804.947079389, {}], {"log"=>"xygfrurbve", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168"}]
[2023/06/20 23:13:25] [debug] [output:stdout:stdout.0] task_id=0 assigned to thread #0
[2023/06/20 23:13:25] [debug] [out flush] cb_destroy coro_id=25
[2023/06/20 23:13:25] [debug] [task] destroy task=0x7950910 (task_id=0)
[2023/06/20 23:13:25] [debug] [input chunk] update output instances with new chunk size diff=95, records=1, input=stdin.0
[2023/06/20 23:13:25] [debug] [input chunk] update output instances with new chunk size diff=95, records=1, input=stdin.0
[2023/06/20 23:13:26] [debug] [task] created task=0x79d9c10 id=0 OK
[0] stdin.0: [[1687302805.447817812, {}], {"log"=>"lgnlcjzwnl", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168"}]
[2023/06/20 23:13:26] [debug] [output:stdout:stdout.0] task_id=0 assigned to thread #0
[1] stdin.0: [[1687302805.948130255, {}], {"log"=>"uhwmylemhu", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168"}]
[2023/06/20 23:13:26] [debug] [out flush] cb_destroy coro_id=26
[2023/06/20 23:13:26] [debug] [task] destroy task=0x79d9c10 (task_id=0)
[2023/06/20 23:13:26] [debug] [input chunk] update output instances with new chunk size diff=95, records=1, input=stdin.0
[2023/06/20 23:13:26] [debug] [input chunk] update output instances with new chunk size diff=95, records=1, input=stdin.0
[0] stdin.0: [[1687302806.449137818, {}], {"log"=>"doewwkdtwx", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168"}]
[2023/06/20 23:13:27] [debug] [task] created task=0x7a62f10 id=0 OK
[1] stdin.0: [[1687302806.949977808, {}], {"log"=>"rwjwuibenx", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168"}]
[2023/06/20 23:13:27] [debug] [output:stdout:stdout.0] task_id=0 assigned to thread #0
[2023/06/20 23:13:27] [debug] [out flush] cb_destroy coro_id=27
[2023/06/20 23:13:27] [debug] [task] destroy task=0x7a62f10 (task_id=0)
[2023/06/20 23:13:27] [debug] [input chunk] update output instances with new chunk size diff=95, records=1, input=stdin.0
[2023/06/20 23:13:27] [debug] [input chunk] update output instances with new chunk size diff=95, records=1, input=stdin.0
[2023/06/20 23:13:28] [debug] [task] created task=0x7a9ab60 id=0 OK
[0] stdin.0: [[1687302807.450893848, {}], {"log"=>"pcaczrcqes", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168"}]
[1] stdin.0: [[1687302807.951367397, {}], {"log"=>"kwkmoghwbm", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168"}]
[2023/06/20 23:13:28] [debug] [output:stdout:stdout.0] task_id=0 assigned to thread #0
[2023/06/20 23:13:28] [debug] [out flush] cb_destroy coro_id=28
[2023/06/20 23:13:28] [debug] [task] destroy task=0x7a9ab60 (task_id=0)
[2023/06/20 23:13:28] [debug] [input chunk] update output instances with new chunk size diff=95, records=1, input=stdin.0
[2023/06/20 23:13:28] [debug] [input chunk] update output instances with new chunk size diff=95, records=1, input=stdin.0
[2023/06/20 23:13:29] [debug] [task] created task=0x7b75540 id=0 OK
[0] stdin.0: [[1687302808.451598896, {}], {"log"=>"eiiolurnhv", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168"}]
[1] stdin.0: [[1687302808.952401873, {}], {"log"=>"afwsrzjqar", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168"}]
[2023/06/20 23:13:29] [debug] [output:stdout:stdout.0] task_id=0 assigned to thread #0
[2023/06/20 23:13:29] [debug] [out flush] cb_destroy coro_id=29
[2023/06/20 23:13:29] [debug] [task] destroy task=0x7b75540 (task_id=0)
[2023/06/20 23:13:29] [debug] [imds] using IMDSv1
[2023/06/20 23:13:29] [debug] [http_client] not using http_proxy for header
[2023/06/20 23:13:29] [debug] [imds] using IMDSv1
[2023/06/20 23:13:29] [debug] [http_client] not using http_proxy for header
[2023/06/20 23:13:29] [debug] [imds] using IMDSv1
[2023/06/20 23:13:29] [debug] [http_client] not using http_proxy for header
[2023/06/20 23:13:29] [debug] [filter:aws:aws.0] found tag Name which is included=1
[2023/06/20 23:13:29] [debug] [filter:aws:aws.0] found tag CUSTOMER_ID which is included=1
[2023/06/20 23:13:29] [debug] [input chunk] update output instances with new chunk size diff=175, records=1, input=stdin.0
[2023/06/20 23:13:29] [debug] [input chunk] update output instances with new chunk size diff=175, records=1, input=stdin.0
[2023/06/20 23:13:30] [debug] [task] created task=0x7c0bda0 id=0 OK
[0] stdin.0: [[1687302809.453176773, {}], {"log"=>"qlisjlqwoa", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168", "Name"=>"mwarzynski-fluentbit-dev", "CUSTOMER_ID"=>"43476f15-4ea1-46ef-895b-47d8f5745ac9"}]
[1] stdin.0: [[1687302809.953764480, {}], {"log"=>"kmjgdegdez", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168", "Name"=>"mwarzynski-fluentbit-dev", "CUSTOMER_ID"=>"43476f15-4ea1-46ef-895b-47d8f5745ac9"}]
[2023/06/20 23:13:30] [debug] [output:stdout:stdout.0] task_id=0 assigned to thread #0
[2023/06/20 23:13:30] [debug] [out flush] cb_destroy coro_id=30
[2023/06/20 23:13:30] [debug] [task] destroy task=0x7c0bda0 (task_id=0)
[2023/06/20 23:13:30] [debug] [input chunk] update output instances with new chunk size diff=175, records=1, input=stdin.0
[2023/06/20 23:13:30] [debug] [input chunk] update output instances with new chunk size diff=175, records=1, input=stdin.0
[2023/06/20 23:13:31] [debug] [task] created task=0x7c94e20 id=0 OK
[0] stdin.0: [[1687302810.454493939, {}], {"log"=>"bspdspfbst", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168", "Name"=>"mwarzynski-fluentbit-dev", "CUSTOMER_ID"=>"43476f15-4ea1-46ef-895b-47d8f5745ac9"}]
[1] stdin.0: [[1687302810.955245984, {}], {"log"=>"kfztcwxlds", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168", "Name"=>"mwarzynski-fluentbit-dev", "CUSTOMER_ID"=>"43476f15-4ea1-46ef-895b-47d8f5745ac9"}]
[2023/06/20 23:13:31] [debug] [output:stdout:stdout.0] task_id=0 assigned to thread #0
[2023/06/20 23:13:31] [debug] [out flush] cb_destroy coro_id=31
[2023/06/20 23:13:31] [debug] [task] destroy task=0x7c94e20 (task_id=0)
[2023/06/20 23:13:31] [debug] [input chunk] update output instances with new chunk size diff=175, records=1, input=stdin.0
[2023/06/20 23:13:31] [debug] [input chunk] update output instances with new chunk size diff=175, records=1, input=stdin.0
[2023/06/20 23:13:32] [debug] [task] created task=0x7d1dea0 id=0 OK
[0] stdin.0: [[1687302811.455955388, {}], {"log"=>"xsipauitvy", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168", "Name"=>"mwarzynski-fluentbit-dev", "CUSTOMER_ID"=>"43476f15-4ea1-46ef-895b-47d8f5745ac9"}]
[1] stdin.0: [[1687302811.956567417, {}], {"log"=>"dadyiftnpx", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168", "Name"=>"mwarzynski-fluentbit-dev", "CUSTOMER_ID"=>"43476f15-4ea1-46ef-895b-47d8f5745ac9"}]
[2023/06/20 23:13:32] [debug] [output:stdout:stdout.0] task_id=0 assigned to thread #0
[2023/06/20 23:13:32] [debug] [out flush] cb_destroy coro_id=32
[2023/06/20 23:13:32] [debug] [task] destroy task=0x7d1dea0 (task_id=0)
[2023/06/20 23:13:32] [debug] [input chunk] update output instances with new chunk size diff=175, records=1, input=stdin.0
[2023/06/20 23:13:32] [debug] [input chunk] update output instances with new chunk size diff=175, records=1, input=stdin.0
[2023/06/20 23:13:33] [debug] [task] created task=0x7da6f20 id=0 OK
[0] stdin.0: [[1687302812.457281436, {}], {"log"=>"tkrwsizxrf", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168", "Name"=>"mwarzynski-fluentbit-dev", "CUSTOMER_ID"=>"43476f15-4ea1-46ef-895b-47d8f5745ac9"}]
[1] stdin.0: [[1687302812.958238344, {}], {"log"=>"rltoogpkzm", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168", "Name"=>"mwarzynski-fluentbit-dev", "CUSTOMER_ID"=>"43476f15-4ea1-46ef-895b-47d8f5745ac9"}]
[2023/06/20 23:13:33] [debug] [output:stdout:stdout.0] task_id=0 assigned to thread #0
[2023/06/20 23:13:33] [debug] [out flush] cb_destroy coro_id=33
[2023/06/20 23:13:33] [debug] [task] destroy task=0x7da6f20 (task_id=0)
[2023/06/20 23:13:33] [debug] [input chunk] update output instances with new chunk size diff=175, records=1, input=stdin.0
[2023/06/20 23:13:33] [debug] [input chunk] update output instances with new chunk size diff=175, records=1, input=stdin.0
^C[2023/06/20 23:13:34] [engine] caught signal (SIGINT)
[2023/06/20 23:13:34] [ warn] [input:stdin:stdin.0] end of file (stdin closed by remote end)
[2023/06/20 23:13:34] [debug] [task] created task=0x7e2fff0 id=0 OK
[2023/06/20 23:13:34] [debug] [output:stdout:stdout.0] task_id=0 assigned to thread #0
[0] stdin.0: [[1687302813.458823645, {}], {"log"=>"snbpnvsoom", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168", "Name"=>"mwarzynski-fluentbit-dev", "CUSTOMER_ID"=>"43476f15-4ea1-46ef-895b-47d8f5745ac9"}]
[1] stdin.0: [[1687302813.959496394, {}], {"log"=>"zapwsnhzak", "az"=>"us-east-1a", "ec2_instance_id"=>"i-0e66fc7f9809d7168", "Name"=>"mwarzynski-fluentbit-dev", "CUSTOMER_ID"=>"43476f15-4ea1-46ef-895b-47d8f5745ac9"}]
[2023/06/20 23:13:34] [ warn] [engine] service will shutdown in max 5 seconds
[2023/06/20 23:13:34] [ warn] [engine] service will shutdown in max 5 seconds
[2023/06/20 23:13:34] [debug] [out flush] cb_destroy coro_id=34
[2023/06/20 23:13:34] [debug] [task] destroy task=0x7e2fff0 (task_id=0)
[2023/06/20 23:13:34] [ info] [engine] service has stopped (0 pending tasks)
[2023/06/20 23:13:34] [ info] [output:stdout:stdout.0] thread worker #0 stopping...
[2023/06/20 23:13:34] [ info] [output:stdout:stdout.0] thread worker #0 stopped
==2648==
==2648== HEAP SUMMARY:
==2648==     in use at exit: 0 bytes in 0 blocks
==2648==   total heap usage: 5,045 allocs, 5,045 frees, 20,241,427 bytes allocated
==2648==
==2648== All heap blocks were freed -- no leaks are possible
==2648==
==2648== For lists of detected and suppressed errors, rerun with: -s
==2648== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

Enter [N/A] in the box, if an item is not applicable to your change.

Testing Before we can approve your change; please submit the following in a comment:

  • [x] Example configuration file for the change
  • [x] Debug log output from testing the change
  • [x] Attached Valgrind output that shows no leaks or memory corruption was found

If this is a change to packaging of containers or native binaries then please confirm it works for all targets.

  • [N/A] Run local packaging test showing all targets (including any new ones) build.
  • [N/A] Set ok-package-test label to test for all targets (requires maintainer to do).

Documentation

  • [x] Documentation required for this feature: https://github.com/fluent/fluent-bit-docs/pull/1138

Backporting

  • [N/A] Backport to latest stable release.

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

mwarzynski avatar Apr 22 '23 15:04 mwarzynski

@PettitWesley Hey, I wonder if you could find some time to check out this PR. Absolutely no pressure though.

mwarzynski avatar May 22 '23 20:05 mwarzynski

@PettitWesley I pushed the first version which should satisfy the minimal requirements for 'Fixed Interval Retries' and handle 500s for EC2 tag requests.

I initially set the default value for retries interval to 5s. We can make it longer (e.g. 5m).

I tested my changes locally. In the local run, FluentBit got 2 logs per second in the input, whereas settings had 1s flush interval. Here are logs from the EC2 Metadata Server 'mock':

mwarzynski@6d1934297b3c:/fluent-bit-dev/build-dev$ python3 ec2_server.py
127.0.0.1 - - [18/Jun/2023 02:48:26] "GET /latest/meta-data/instance-id/ HTTP/1.1" 200 -
127.0.0.1 - - [18/Jun/2023 02:48:26] "GET /latest/meta-data/placement/availability-zone/ HTTP/1.1" 200 -
127.0.0.1 - - [18/Jun/2023 02:48:26] "GET /latest/meta-data/tags/instance HTTP/1.1" 500 -
127.0.0.1 - - [18/Jun/2023 02:48:31] "GET /latest/meta-data/tags/instance HTTP/1.1" 500 -
127.0.0.1 - - [18/Jun/2023 02:48:36] "GET /latest/meta-data/tags/instance HTTP/1.1" 500 -
127.0.0.1 - - [18/Jun/2023 02:48:41] "GET /latest/meta-data/tags/instance HTTP/1.1" 500 -
^Cmwarzynski@6d1934297b3c:/fluent-bit-dev/build-dev$ python3 ec2_server.py  # fixed the 500 bug, reload server -- will respond with 200s
127.0.0.1 - - [18/Jun/2023 02:48:46] "GET /latest/meta-data/tags/instance HTTP/1.1" 200 -
127.0.0.1 - - [18/Jun/2023 02:48:46] "GET /latest/meta-data/tags/instance/Name HTTP/1.1" 200 -
127.0.0.1 - - [18/Jun/2023 02:48:46] "GET /latest/meta-data/tags/instance/CUSTOMER_ID HTTP/1.1" 200 -

As you can see, Fluent Bit made requests for EC2 tags every 5s even though flushes where happening every 1s. If there wouldn't be any logs flushed, then the interval would be even longer (until the next flush). Here are FluentBit logs: https://gist.github.com/mwarzynski/83f9dfc8444fb6f68ba2df68c949ca2e

Please note that I introduced a concept of metadata_group, such that we can separate the concerns and define how critical each of the metadata groups is in the future. At the moment, if anything from the 'base' group fails, then we don't include any successfully fetched information in the log. However, I believe the correct approach would be to separate the concerns and either push whatever we managed to fetch, or block on retries for the metadata groups specified as required in the configuration. For this reason, I added another struct definition flb_aws_filter_metadata_group, such that we can extend the retries strategy and implement more groups in the future.

Let me know what you think about the configuration design for the retries seconds: tags_retry_interval_s. In my opinion, if we would switch later on and allow retries for each metadata group separately, then we want something more general, e.g. simply retry_interval_s. tags_retry_interval_s is too specific. If we will add a separate metadata group for instance id, az, ami, and so on, then creating separate config options for each metadata group will be an overkill.

backoff formula is probably more complicated

Yeah, it is. I would stick with a simpler solution for the first iteration. I can propose the backoff exp in the next PRs (if we think it's worth it).

For some use cases (for transparency, mine included), it might be useful to block pushing logs until Filter AWS fetches successfully all information groups. May be we handle this with a special zero value for the timeout time?

I think we should make the configuration more explicit. In case someone wants to require EC2 Tags in the logs, we can add another configuration called required tags (with `` as a default). If required tags, then we wouldn't allow to flush logs until EC2 tags are successfully fetched (or disabled == 404). Such configuration design would also make it easily extendible in the future (e.g. required tags,instance_id or required * which would require all enabled groups).

What do you think?

mwarzynski avatar Jun 18 '23 03:06 mwarzynski

@mwarzynski I like the idea of creating a "required" tag group which causes the filter to fail to send logs unless certain metadata can be fetched.

For some use cases (for transparency, mine included), it might be useful to block pushing logs until Filter AWS fetches successfully all information groups.

Would you be okay with losing the logs instead? That sentence should read "unless" not "until". Because that's what will happen if the filter won't pass through any logs unless a value can be fetched.

I think that's probably a very niche use case?

PettitWesley avatar Jun 19 '23 22:06 PettitWesley

Would you be okay with losing the logs instead? That sentence should read "unless" not "until". Because that's what will happen if the filter won't pass through any logs unless a value can be fetched.

@PettitWesley In my personal opinion, I would rather submit logs without the necessary information than lose them altogether. Of course, this decision may vary depending on the specific use case. However, I likely would not be comfortable with losing the logs completely. It is preferable to have the logs without certain metadata than to lose them entirely.

Isn't it dependent on the type of input? In other words, inputs can vary in terms of source persistence:

  1. Non-persistent inputs, such as stdout from a command execution.
  2. Persistent inputs, like using tail on a log file (assuming that file+offset.db is stored on a persistent volume).

I would assume that in the scenario where required metadata is not fetched, the persistent input logs (2.) would not be lost. At the same time, Filter probably has no reliable way to distinguish source persistence, so given that we are not comfortable with losing the non-persistent logs, it might not be productive to introduce the required option to filters.

I think that's probably a very niche use case?

It most likely is. Furthermore, I would like to emphasize that it would address a specific scenario that is not expected to occur frequently, if at all.

mwarzynski avatar Jun 19 '23 23:06 mwarzynski

@mwarzynski

Persistent inputs, like using tail on a log file (assuming that file+offset.db is stored on a persistent volume).

So tail input keeps a file offset position. Once it reads the logs, then it advances offset. The tail input then considers the data read. If a filter then drops it all, the tail input still considers it read.

Take a look at code in flb_filter.c in flb_filter_do.

PettitWesley avatar Jun 20 '23 00:06 PettitWesley

I also would like to outline a brief plan for future PRs concerning retries. From my perspective, we can divide the work into a few parts:

  1. Implementing retries separately for all metadata groups:
    • For example, if the instance_id fails, the filter AWS should still inject other metadata groups like ami_id.
  2. Introducing retries duration with exponential back-off:
    • For instance, if an issue was very short-lived and only occurred in one request, it might be beneficial to retry faster a few times. Eventually, after a few quicker retries, filter aws can retry at the specified interval in the configuration (e.g., 5 minutes).
  3. Support refresh of the EC2 Tags for Nitro instances,

These PRs will serve as a follow-up and can be considered as a starting point for discussion. It's important to clarify that I'm not assuming a commitment to implementing these changes from your side. I will likely open a PR, and we can discuss the potential improvements in these areas once they are open.

mwarzynski avatar Jun 23 '23 00:06 mwarzynski

This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days.

github-actions[bot] avatar Oct 07 '23 01:10 github-actions[bot]

@mwarzynski is this waiting on another round of review from me?

PettitWesley avatar Oct 08 '23 18:10 PettitWesley

@PettitWesley Yes, I think so.

mwarzynski avatar Oct 09 '23 22:10 mwarzynski

I rebased my changes on the latest master. I also went through each discussion and made sure there is nothing actionable on my side on each of them (it might be good to get the answer about behavior for retry_interval_s=0 -- if we want to do this in this PR).

mwarzynski avatar Oct 21 '23 21:10 mwarzynski

@PettitWesley Hey! Would you please let me know what are the next steps here? Should I put in the work and rebase my changes on the latest master? Or maybe we don't need to / want to merge them and can close this PR?

mwarzynski avatar Jan 12 '24 11:01 mwarzynski

@PettitWesley No worries. Thank you for the code review! Great news that it looks okay.

I will try to find some time today and rebase it on the latest master / check why it doesn't compile.

mwarzynski avatar Mar 13 '24 14:03 mwarzynski

I rebased my changes on the latest origin:master. It compiled successfully on my machine.

...
[ 89%] Linking C executable ../bin/fluent-bit
[ 89%] Building C object src/CMakeFiles/fluent-bit-shared.dir/flb_mp.c.o
[ 89%] Building C object src/CMakeFiles/fluent-bit-shared.dir/flb_kv.c.o
[ 89%] Building C object src/CMakeFiles/fluent-bit-shared.dir/flb_api.c.o
[ 89%] Building C object src/CMakeFiles/fluent-bit-shared.dir/flb_csv.c.o
[ 89%] Building C object src/CMakeFiles/fluent-bit-shared.dir/flb_lib.c.o
[ 89%] Building C object src/CMakeFiles/fluent-bit-shared.dir/flb_log.c.o
[ 89%] Building C object src/CMakeFiles/fluent-bit-shared.dir/flb_env.c.o
[ 89%] Building C object src/CMakeFiles/fluent-bit-shared.dir/flb_file.c.o
[ 89%] Building C object src/CMakeFiles/fluent-bit-shared.dir/flb_uri.c.o
[ 89%] Building C object src/CMakeFiles/fluent-bit-shared.dir/flb_hash_table.c.o
[ 90%] Building C object src/CMakeFiles/fluent-bit-shared.dir/flb_help.c.o
[ 90%] Building C object src/CMakeFiles/fluent-bit-shared.dir/flb_pack.c.o
[ 90%] Built target fluent-bit-bin
...
Scanning dependencies of target hello_world
Scanning dependencies of target out_lib
[ 98%] Building C object examples/hello_world/CMakeFiles/hello_world.dir/hello_world.c.o
[ 98%] Building C object examples/out_lib/CMakeFiles/out_lib.dir/out_lib.c.o
[ 98%] Building C object examples/hello_world/CMakeFiles/hello_world.dir/__/__/lib/lwrb/lwrb/src/lwrb/lwrb.c.o
[100%] Building C object examples/out_lib/CMakeFiles/out_lib.dir/__/__/lib/lwrb/lwrb/src/lwrb/lwrb.c.o
[100%] Linking C executable ../../bin/hello_world
[100%] Linking C executable ../../bin/out_lib
[100%] Built target hello_world
[100%] Built target out_lib

mwarzynski avatar Mar 15 '24 19:03 mwarzynski

I used the config that you have in the description and got a single attempt to request tags, which fails, since its not enabled, then it proceeded adding other metadata to logs:

[2024/03/18 22:21:41] [debug] [http_client] server 169.254.169.254:80 will close connection #23
[2024/03/18 22:21:41] [debug] [aws_client] (null): http_do=0, HTTP Status: 404
[2024/03/18 22:21:41] [debug] [imds] metadata request failure response
<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
		 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
 <head>
  <title>404 - Not Found</title>
 </head>
 <body>
  <h1>404 - Not Found</h1>
 </body>
</html>

[2024/03/18 22:21:41] [ warn] [filter:aws:aws.0] EC2 instance metadata tag request returned 404. This likely indicates your instance has no tags or the EC2 tagging metadata API is not enabled
[2024/03/18 22:21:41] [debug] [stdout:stdout.0] created event channels: read=23 write=24
[2024/03/18 22:21:41] [ info] [output:stdout:stdout.0] worker #0 started
[2024/03/18 22:21:41] [ info] [sp] stream processor started
[2024/03/18 22:21:42] [debug] [input chunk] update output instances with new chunk size diff=94, records=1, input=dummy.0
[2024/03/18 22:21:43] [debug] [task] created task=0x7f4db001f5c0 id=0 OK
[2024/03/18 22:21:43] [debug] [output:stdout:stdout.0] task_id=0 assigned to thread #0
[2024/03/18 22:21:43] [debug] [input chunk] update output instances with new chunk size diff=94, records=1, input=dummy.0
[0] dummy.0: [[1710800502.687835217, {}], {"message"=>"dummy", "az"=>"us-west-2a", "ec2_instance_id"=>"i-0385871b581452acd"}]

It didn't retry ever again for the tags and get a failed response again from what I can tell. I think I'm not understanding something here. I configured the retry_interval_s 5.

PettitWesley avatar Mar 18 '24 22:03 PettitWesley

I think I understand. IMDS returns 404 when tagging is not enabled, which counts as a non-retryable error.

I think this PR and code is fine as is. However, remember the nitro case you brought up. Where a user can enable tags while FLB is running without restarting the instance.

I think those users are the ones who might want this retry feature with a retry_interval_s. Whereas, I am not sure how or why users not on nitro would want the retry interval.

AFAICT, there are two features here:

  1. Refactoring to separate metadata groups so that one can fail and the others succeed and the plugin still will add what metadata it has an continue. <-- I ship this part of it
  2. Adding a retry_interval_s config option. I still do not see the use case for this. When will a user configure it? Could we just have some default value for now? I'm wondering if nitro users would want this, and would wnat to enable retries by setting the interval, whereas other users might want no retries and thus would want a default no_retries setting or some option.

What do you think?

My initial version of this PR is a 'proposal'. Minimal effort to handle 500s.

So if we ship this as is, the use case is that it will retry if IMDS returns a 500?

I think that's very useful and thank you for adding it. I am not sure if it justifies a new config option. I suppose configurability is better than nothing, but I am struggling to imagine how users pick their desired value. What impacts the value you want? Why did you choose 5s as the default?

PettitWesley avatar Mar 18 '24 22:03 PettitWesley

@PettitWesley Hey! Thank you for checking it out.

I think I understand. IMDS returns 404 when tagging is not enabled, which counts as a non-retryable error.

Yes, correct. Fluent Bit also produced a warning about this case (taken from your logs):

[2024/03/18 22:21:41] [ warn] [filter:aws:aws.0] EC2 instance metadata tag request returned 404. This likely indicates your instance has no tags or the EC2 tagging metadata API is not enabled

AFAICT, there are two features here:

Yes. From the Fluent Bit's behaviour perspective, there are two main goals/deliverables with this PR:

  1. attach as many aws metadata groups as Fluent Bit is able to fetch (i.e. do not break on first metadata group fetch error),
  2. retries for failed metadata group fetches during consecutive flushes,

One of value propositions here, as perceived by myself, would be to handle errors related to fetching aws metadata groups. I had experienced issues in the past with our 'in-house component' due to EC2 instances not being able to connect to 'aws metadata server'. This was difficult to debug, because all errors concluded with message that aws metadata server is 'not reachable'. Ultimately we didn't find the root cause of this issue. We do have a bug in the backlog and are monitoring if this will happen again. (We didn't find anything helpful for troubleshooting in the AWS documentation, but maybe we missed something.) To be honest, this PR would probably not help us to avoid this bug as retries would still consistently hit the 'not reachable' error. It doesn't imply though this PR won't help in some other cases, as our component might have automatically recovered from due to the 'retry strategy' already in place, so that's why I might not know about such situations...

In general, I believe there might be various technical issues while performing fetches for aws metadata group. Therefore adding simple retries to mitigate these errors seems like a good solution.

What do you think?

I assume your questions was regarding the following topics. (If there are additional things you meant, feel free to ask.)

  1. Should 'retry' be enabled for all use cases?

I believe retries should be enabled by default for all use cases (not only Nitro EC2 instances, but also usual EC2 instances). It's the same line of thinking: there might be an issue with fetching metadata groups => it would be nice to automatically retry fetching the metadata groups and ensure we do our best to push enriched logs.

Retry attempts are made worst case at every flush, so even if retries will fail at each attempt, there should be no noticeable performance effects overall. (Well, okay, it's a hypothesis, we might confirm this via a real world test.) I struggle to see a good argument in favor of disabling the retries.

  1. Should 'retry_interval_s' be configurable by the end user?

You can make a decision.

Introducing complexity in a popular project isn't the same as doing so in a hobby project. You will have to maintain this config options in the future (and take care about the communication with the users). I guess this means we want to add a new option only if we absolutely need to.

  1. What value for 'retry_interval_s' would make sense for EC2 Nitro users?

Note that current PR does not support the use case of dynamic changes in the aws metadata groups values. For instance, if Fluent Bit fetches all metadata groups correctly at the first try, it won't attempt to fetch them again -- metadata groups won't be refreshed even if user will change them for the Nitro instance.

Let's imagine that we actually have such solution of refreshing the aws metadata groups (tags) in place for Nitro EC2 instances.

Unfortunately no hard assumption can be made when Fluent Bit (FB) will refresh the log labels (as those refreshes may fail). My line of thinking is: in order to ensure logs always have expected labels, the user's solution must make the components running inside the EC2 Nitro instance produce logs with those expected critical labels. In other words, the design for adding labels to the logs would be such that Nitro 'dynamic' tags are not 'critical' to the observability of the solution (those labels are only 'nice to have'). This implies that re-fetches of dynamic logs could be done on the order of minutes, so my proposal for this case would be 1m as the default.


So if we ship this as is, the use case is that it will retry if IMDS returns a 500?

Yes, it will retry for 500 errors, 'not reachable' errors, or anything that's not expected.

Why did you choose 5s as the default?

I picked 5s as a rule of thumb. Looks good based on my intuition. There is no cleverness behind / it might be stupid. Basically, I didn't want to do it at every flush(=1s), as in case of persisting errors for fetching metadata groups, Fluent Bit would produce a lot of error logs. 5x less is 5s. This logic is obviously flawed as users can set a different flush period or may not be sensitive to the amount of error logs produced by Fluent Bit.

So, what decision do we want to make? Do we remove the config option for retry_interval_s and hardcode it by default to 5s (or 10s/1s)?


Well, this comment is pretty long. If you prefer me to be more concise, just let me know. (After all, those LLMs might be useful for something).

mwarzynski avatar Mar 26 '24 02:03 mwarzynski

@mwarzynski I've merged this into master.

PettitWesley avatar May 03 '24 22:05 PettitWesley

Thank you!

@PettitWesley Do we want to adjust the documentation of the config parameters? https://github.com/fluent/fluent-bit-docs/pull/1138

mwarzynski avatar May 04 '24 00:05 mwarzynski

@mwarzynski we must wait to merge the docs until this is released. Please watch the release notes or tags for a new one to be cut.

PettitWesley avatar May 06 '24 18:05 PettitWesley