fluent-bit icon indicating copy to clipboard operation
fluent-bit copied to clipboard

filter: aws ec2 tags

Open mwarzynski opened this issue 3 years ago • 5 comments

PR adds EC2 instance tags to the AWS Metadata. Feature is opt-in (i.e. disabled by default). If enabled, the aws filter will fetch instance tags directly from the EC2 metadata server at the initialization phase and add tags to logs. Logic is exactly the same as it's currently implemented for other AWS metadata.

Related AWS announcement: https://aws.amazon.com/about-aws/whats-new/2022/01/instance-tags-amazon-ec2-instance-metadata-service/

This feature works if EC2 instance is configured to expose tags in the metadata server. For reference: https://awscli.amazonaws.com/v2/documentation/api/latest/reference/ec2/modify-instance-metadata-options.html and --instance-metadata-tags. It is disabled by default.

How to use?

Enriching logs with EC2 instance tags might be enabled by tags_enabled true option in the configuration for the AWS Filter.

Example configuration:

  [FILTER]
      Name aws
      Match *
      imds_version v1
      tags_enabled true  # <- new config option

For instance, if we would launch fluent-bit with the configuration specified above, then output logs would contain EC2 tags, e.g.:

{
  "log": "something",
  "tag_name1": "tag_value1",
  "tag_name2": "tag_value2",
  "tag_name3": "tag_value3",
  "ItIsJustAString": "soValuesMayNotBeConsistent",
  "Name": "MyNameForEC2Instance"
}

I also would like to outline, that it allows to use EC2 tags in templates in the Output configuration. Let's see how it would look like for the Cloudwatch:

[OUTPUT]
  Name cloudwatch_logs
  Match *
  region us-west-2
  log_group_name myawesomelogs
  log_stream_prefix tagvalue1missing-    # used for fallback if tag is not present,
  log_stream_template $tag_name1  # <- will use 'tag_value1' as the stream name, at least for the EC2 which we've seen logs before
  auto_create_group On

How it works?

If enabled, in the init phase, it fetches the list of tag keys from the /latest/meta-data/tags/instance endpoint, and then for each tag key uses /latest/meta-data/tags/instance/{tag_key} to fetch the tag value.

Tags are stored in the AWS Filter ctx and therefore set from 'in-memory cache' for each log.

Drawbacks / Limitations

Tags are too large

Current implementation adds all EC2 tag keys and values, which might significantly influence the size of pushed logs. Sometimes the use case might only require a few tags (or just one). It naturally leads us to ask a question about a possible improvement: introduction of the whitelist for tag keys to attach. Similarly, another use case might require using a blacklist.

As you can see, there are a few possible improvements. However I would like to propose doing so in another PR. If you agree, the only thing I would like to bring more attention to is: how do we make the configuration options open for changes in the future (in a backwards compatible way).

My way of thinking would be that tags_enabled true will work fine, even if we would consider possible future improvements.

  1. tags_enabled true enables the EC2 tag instances feature in general,
  2. if no additional tag related options are present, it uses all EC2 tags (as it does in this PR),
  3. --- points 4., 5. and 6. are related to the potential improvements in the future:
  4. we could add tag_key_whitelist key1,key2 (which would work only if tags_enabled true is set),
  5. we could add tag_key_blacklist key1,key2 (which would work only if tags_enabled true is set),
  6. providing both tag_key_whitelist and tag_key_blacklist options is invalid,

Tags refresh for instances built on Nitro system

Since the current implementation will only fetch the EC2 tags once, I guess it also means fluent-bit won't refresh the list from the metadata server.

First, let's take a look if AWS updates the tag values in metadata server:

If you add or remove an instance tag, the instance metadata is updated while the instance is running for instances built on the Nitro System, without needing to stop and then start the instance. For all other instances, to update the tags in the instance metadata, you must stop and then start the instance.

Source: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/Using_Tags.html#work-with-tags-in-IMDS

In order to get updated list of tags in the metadata server for 'usual' EC2 instances, we need to stop & start the instance. Obviously after the restart of the instance, we would also restart fluent-bit, so effectively it would have up-to-date list of tags. I think it's okay not to bring more complexity into the code and implement refreshes. It only would be required for the instances built on the Nitro system (tbh, I've never used one). Of course, it's an area for improvement, so at some point someone might want to do so. However, for the first iteration I propose to stick with the current implementation.

Testing

EC2 instance (with tags)

Here is my EC2 instance which I used for tests.

image

image

Fluent-bit configuration

[SERVICE] has default values.

[INPUT]
    name              tail
    path              /var/log/messages
    tag               messages
    Path_Key          filepath
    Read_from_Head    true

[FILTER]
    Name aws
    Match *
    imds_version v1
    az false
    ec2_instance_id true
    ec2_instance_type false
    private_ip false
    ami_id false
    account_id false
    hostname false
    vpc_id false
    tags_enabled true

[OUTPUT]
    Name cloudwatch_logs
    Match *
    region eu-west-2
    log_group_name fluentbit-test
    log_stream_prefix unknowninstance-
    log_stream_template $CUSTOMER_ID
    auto_create_group On

Cases

1. EC2 instance 'access to tags in instance metadata' enabled

image

ubuntu@ip-172-31-8-222:/etc/fluent-bit$ /usr/local/bin/fluent-bit -c /etc/fluent-bit/fluent-bit.conf
Fluent Bit v2.0.0
* Copyright (C) 2015-2022 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2022/09/20 17:23:21] [ info] [fluent bit] version=2.0.0, commit=82c2b2f4bc, pid=1196
[2022/09/20 17:23:21] [ info] [storage] version=1.3.0, type=memory-only, sync=normal, checksum=disabled, max_chunks_up=128
[2022/09/20 17:23:21] [ info] [cmetrics] version=0.4.0
[2022/09/20 17:23:21] [ info] [sp] stream processor started
[2022/09/20 17:23:21] [ info] [output:cloudwatch_logs:cloudwatch_logs.0] worker #0 started
[2022/09/20 17:23:21] [ info] [input:tail:tail.1] inotify_fs_add(): inode=1855 watch_fd=1 name=/var/log/messages
[2022/09/20 17:23:21] [ info] [output:cloudwatch_logs:cloudwatch_logs.0] Creating log stream 5e5cfb96-66c9-497a-8e6d-1fea87783ba6 in log group fluentbit-test
[2022/09/20 17:23:21] [ info] [output:cloudwatch_logs:cloudwatch_logs.0] Created log stream 5e5cfb96-66c9-497a-8e6d-1fea87783ba6
^C[2022/09/20 17:24:24] [engine] caught signal (SIGINT)
[2022/09/20 17:24:24] [ warn] [engine] service will shutdown in max 5 seconds
[2022/09/20 17:24:24] [ info] [input] pausing tail.0
[2022/09/20 17:24:24] [ info] [input] pausing tail.1
[2022/09/20 17:24:24] [ info] [engine] service has stopped (0 pending tasks)
[2022/09/20 17:24:24] [ info] [input] pausing tail.0
[2022/09/20 17:24:24] [ info] [input] pausing tail.1
[2022/09/20 17:24:24] [ info] [input:tail:tail.1] inotify_fs_remove(): inode=1855 watch_fd=1
[2022/09/20 17:24:24] [ info] [output:cloudwatch_logs:cloudwatch_logs.0] thread worker #0 stopping...
[2022/09/20 17:24:24] [ info] [output:cloudwatch_logs:cloudwatch_logs.0] thread worker #0 stopped

Logs were pushed to the CloudWatch:

image

2. EC2 instance 'access to tags in instance metadata' disabled

image

ubuntu@ip-172-31-8-222:~$ /usr/local/bin/fluent-bit -c /etc/fluent-bit/fluent-bit.conf
Fluent Bit v2.0.0
* Copyright (C) 2015-2022 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2022/09/20 17:28:57] [ info] [fluent bit] version=2.0.0, commit=82c2b2f4bc, pid=1195
[2022/09/20 17:28:57] [ info] [storage] version=1.3.0, type=memory-only, sync=normal, checksum=disabled, max_chunks_up=128
[2022/09/20 17:28:57] [ info] [cmetrics] version=0.4.0
[2022/09/20 17:28:57] [ warn] [filter_aws] tags not available in the EC2 instance metadata
[2022/09/20 17:28:57] [ info] [sp] stream processor started
[2022/09/20 17:28:57] [ info] [output:cloudwatch_logs:cloudwatch_logs.0] worker #0 started
[2022/09/20 17:28:57] [ info] [input:tail:tail.1] inotify_fs_add(): inode=1855 watch_fd=1 name=/var/log/messages
[2022/09/20 17:28:57] [ warn] [record accessor] translation failed, root key=CUSTOMER_ID
[2022/09/20 17:28:57] [ warn] [record accessor] translation failed, root key=CUSTOMER_ID
... // A LOT OF WARNINGS DUE TO INVALID VARIABLE IN THE TEMPLATE //
^C[2022/09/20 17:30:14] [engine] caught signal (SIGINT)
[2022/09/20 17:30:14] [ warn] [engine] service will shutdown in max 5 seconds
[2022/09/20 17:30:14] [ info] [input] pausing tail.0
[2022/09/20 17:30:14] [ info] [input] pausing tail.1
[2022/09/20 17:30:14] [ info] [engine] service has stopped (0 pending tasks)
[2022/09/20 17:30:14] [ info] [input] pausing tail.0
[2022/09/20 17:30:14] [ info] [input] pausing tail.1
[2022/09/20 17:30:14] [ info] [input:tail:tail.1] inotify_fs_remove(): inode=1855 watch_fd=1
[2022/09/20 17:30:14] [ info] [output:cloudwatch_logs:cloudwatch_logs.0] thread worker #0 stopping...
[2022/09/20 17:30:14] [ info] [output:cloudwatch_logs:cloudwatch_logs.0] thread worker #0 stopped

Logs pushed to the CloudWatch:

image

(Note that log streams were created based on the prefix from configuration.)

image

As expected, since instance metadata server has no tags, there are also no EC2 tags in the CloudWatch.

3. EC2 instance 'access to tags in instance metadata' enabled, 0 tags on the instance

It tests an edge case where there are no tags to be found in the metadata server.

image

image

ubuntu@ip-172-31-8-222:/etc/fluent-bit$ /usr/local/bin/fluent-bit -c /etc/fluent-bit/fluent-bit.conf
Fluent Bit v2.0.0
* Copyright (C) 2015-2022 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2022/09/20 17:35:11] [ info] [fluent bit] version=2.0.0, commit=82c2b2f4bc, pid=1188
[2022/09/20 17:35:11] [ info] [storage] version=1.3.0, type=memory-only, sync=normal, checksum=disabled, max_chunks_up=128
[2022/09/20 17:35:11] [ info] [cmetrics] version=0.4.0
[2022/09/20 17:35:11] [ warn] [filter_aws] tags not available in the EC2 instance metadata
[2022/09/20 17:35:11] [ info] [sp] stream processor started
[2022/09/20 17:35:11] [ info] [output:cloudwatch_logs:cloudwatch_logs.0] worker #0 started
[2022/09/20 17:35:11] [ info] [input:tail:tail.1] inotify_fs_add(): inode=1855 watch_fd=1 name=/var/log/messages
[2022/09/20 17:35:11] [ warn] [record accessor] translation failed, root key=CUSTOMER_ID
[2022/09/20 17:35:11] [ info] [output:cloudwatch_logs:cloudwatch_logs.0] Creating log stream unknowninstance-messages in log group fluentbit-test
[2022/09/20 17:35:12] [ info] [output:cloudwatch_logs:cloudwatch_logs.0] Created log stream unknowninstance-messages
[2022/09/20 17:35:11] [ warn] [record accessor] translation failed, root key=CUSTOMER_ID

^C[2022/09/20 17:36:44] [engine] caught signal (SIGINT)
[2022/09/20 17:36:44] [ warn] [engine] service will shutdown in max 5 seconds
[2022/09/20 17:36:44] [ info] [input] pausing tail.0
[2022/09/20 17:36:44] [ info] [input] pausing tail.1
[2022/09/20 17:36:44] [ info] [engine] service has stopped (0 pending tasks)
[2022/09/20 17:36:44] [ info] [input] pausing tail.0
[2022/09/20 17:36:44] [ info] [input] pausing tail.1
[2022/09/20 17:36:44] [ info] [input:tail:tail.1] inotify_fs_remove(): inode=1855 watch_fd=1
[2022/09/20 17:36:44] [ info] [output:cloudwatch_logs:cloudwatch_logs.0] thread worker #0 stopping...
[2022/09/20 17:36:44] [ info] [output:cloudwatch_logs:cloudwatch_logs.0] thread worker #0 stopped

image

Debug log output from testing the change + Valgrind output

ubuntu@ip-172-31-8-222:~/fluent-bit/build$ cat /etc/fluent-bit/fluent-bit.conf
[SERVICE]
    flush        1
    daemon       Off
    log_level    debug
    parsers_file parsers.conf
    plugins_file plugins.conf
    http_server  Off
    http_listen  0.0.0.0
    http_port    2020
    storage.metrics on

[INPUT]
    name              tail
    path              /home/ubuntu/fluentbit-test-log
    tag               messages
    Path_Key          filepath
    Read_from_Head    true

[FILTER]
    Name aws
    Match *
    imds_version v1
    az false
    ec2_instance_id true
    ec2_instance_type false
    private_ip false
    ami_id false
    account_id false
    hostname false
    vpc_id false
    tags_enabled true

[OUTPUT]
    name  stdout
    match *

ubuntu@ip-172-31-8-222:~/fluent-bit/build$ cat /home/ubuntu/fluentbit-test-log
hello world
my name is mateusz

ubuntu@ip-172-31-8-222:~/fluent-bit/build$ valgrind --leak-check=full ./bin/fluent-bit -c /etc/fluent-bit/fluent-bit.conf
==49519== Memcheck, a memory error detector
==49519== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==49519== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info
==49519== Command: ./bin/fluent-bit -c /etc/fluent-bit/fluent-bit.conf
==49519==
Fluent Bit v2.0.0
* Copyright (C) 2015-2022 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2022/09/20 19:35:14] [ info] Configuration:
[2022/09/20 19:35:14] [ info]  flush time     | 1.000000 seconds
[2022/09/20 19:35:14] [ info]  grace          | 5 seconds
[2022/09/20 19:35:14] [ info]  daemon         | 0
[2022/09/20 19:35:14] [ info] ___________
[2022/09/20 19:35:14] [ info]  inputs:
[2022/09/20 19:35:14] [ info]      tail
[2022/09/20 19:35:14] [ info] ___________
[2022/09/20 19:35:14] [ info]  filters:
[2022/09/20 19:35:14] [ info]      aws.0
[2022/09/20 19:35:14] [ info] ___________
[2022/09/20 19:35:14] [ info]  outputs:
[2022/09/20 19:35:14] [ info]      stdout.0
[2022/09/20 19:35:14] [ info] ___________
[2022/09/20 19:35:14] [ info]  collectors:
[2022/09/20 19:35:14] [ info] [fluent bit] version=2.0.0, commit=6e70078d00, pid=49519
[2022/09/20 19:35:14] [debug] [engine] coroutine stack size: 24576 bytes (24.0K)
[2022/09/20 19:35:14] [ info] [storage] version=1.3.0, type=memory-only, sync=normal, checksum=disabled, max_chunks_up=128
[2022/09/20 19:35:14] [ info] [cmetrics] version=0.4.0
[2022/09/20 19:35:14] [debug] [tail:tail.0] created event channels: read=21 write=22
[2022/09/20 19:35:14] [debug] [input:tail:tail.0] flb_tail_fs_inotify_init() initializing inotify tail input
[2022/09/20 19:35:14] [debug] [input:tail:tail.0] inotify watch fd=27
[2022/09/20 19:35:14] [debug] [input:tail:tail.0] scanning path /home/ubuntu/fluentbit-test-log
[2022/09/20 19:35:14] [debug] [input:tail:tail.0] inode=308488 with offset=0 appended as /home/ubuntu/fluentbit-test-log
[2022/09/20 19:35:14] [debug] [input:tail:tail.0] scan_glob add(): /home/ubuntu/fluentbit-test-log, inode 308488
[2022/09/20 19:35:14] [debug] [input:tail:tail.0] 1 new files found on path '/home/ubuntu/fluentbit-test-log'
[2022/09/20 19:35:14] [debug] [http_client] not using http_proxy for header
[2022/09/20 19:35:14] [debug] [filter:aws:aws.0] Using IMDSv1
[2022/09/20 19:35:14] [debug] [http_client] server 169.254.169.254:80 will close connection #29
[2022/09/20 19:35:14] [debug] [filter:aws:aws.0] imds metadata request http_do=0; /latest/meta-data/instance-id/, status_code: 200, response: i-0e66fc7f9809d7168
[2022/09/20 19:35:14] [debug] [http_client] not using http_proxy for header
[2022/09/20 19:35:14] [debug] [filter:aws:aws.0] Using IMDSv1
[2022/09/20 19:35:14] [debug] [http_client] server 169.254.169.254:80 will close connection #29
[2022/09/20 19:35:14] [debug] [filter:aws:aws.0] imds metadata request http_do=0; /latest/meta-data/tags/instance, status_code: 200, response: CUSTOMER_ID
Name
[2022/09/20 19:35:14] [debug] [filter_aws] tag found: CUSTOMER_ID (len=11)
[2022/09/20 19:35:14] [debug] [filter_aws] tag found: Name (len=4)
[2022/09/20 19:35:14] [debug] [http_client] not using http_proxy for header
[2022/09/20 19:35:14] [debug] [filter:aws:aws.0] Using IMDSv1
[2022/09/20 19:35:14] [debug] [http_client] server 169.254.169.254:80 will close connection #29
[2022/09/20 19:35:14] [debug] [filter:aws:aws.0] imds metadata request http_do=0; /latest/meta-data/tags/instance/CUSTOMER_ID, status_code: 200, response: 373ae935-121e-4a4d-9d93-9c96e7917452
[2022/09/20 19:35:14] [debug] [http_client] not using http_proxy for header
[2022/09/20 19:35:14] [debug] [filter:aws:aws.0] Using IMDSv1
[2022/09/20 19:35:14] [debug] [http_client] server 169.254.169.254:80 will close connection #29
[2022/09/20 19:35:14] [debug] [filter:aws:aws.0] imds metadata request http_do=0; /latest/meta-data/tags/instance/Name, status_code: 200, response: mwarzynski-fluentbit-dev
[2022/09/20 19:35:14] [debug] [http_client] not using http_proxy for header
[2022/09/20 19:35:14] [debug] [stdout:stdout.0] created event channels: read=29 write=30
[2022/09/20 19:35:14] [ info] [output:stdout:stdout.0] worker #0 started
[2022/09/20 19:35:14] [debug] [router] match rule tail.0:stdout.0
[2022/09/20 19:35:14] [ info] [sp] stream processor started
[2022/09/20 19:35:14] [debug] [input chunk] update output instances with new chunk size diff=491
[2022/09/20 19:35:14] [debug] [input:tail:tail.0] [static files] processed 31b
[2022/09/20 19:35:14] [debug] [input:tail:tail.0] inode=308488 file=/home/ubuntu/fluentbit-test-log promote to TAIL_EVENT
[2022/09/20 19:35:14] [ info] [input:tail:tail.0] inotify_fs_add(): inode=308488 watch_fd=1 name=/home/ubuntu/fluentbit-test-log
[2022/09/20 19:35:14] [debug] [input:tail:tail.0] [static files] processed 0b, done
[2022/09/20 19:35:15] [debug] [task] created task=0x54eb120 id=0 OK
[2022/09/20 19:35:15] [debug] [output:stdout:stdout.0] task_id=0 assigned to thread #0
==49519== Warning: client switching stacks?  SP change: 0x6ff2878 --> 0x54f15d0
==49519==          to suppress, use: --max-stackframe=28316328 or greater
==49519== Warning: client switching stacks?  SP change: 0x54f1528 --> 0x6ff2878
==49519==          to suppress, use: --max-stackframe=28316496 or greater
==49519== Warning: client switching stacks?  SP change: 0x6ff2878 --> 0x54f1528
==49519==          to suppress, use: --max-stackframe=28316496 or greater
==49519==          further instances of this message will not be shown.
[0] messages: [1663702514.813338620, {"filepath"=>"/home/ubuntu/fluentbit-test-log", "log"=>"hello world", "ec2_instance_id"=>"i-0e66fc7f9809d7168", "CUSTOMER_ID"=>"373ae935-121e-4a4d-9d93-9c96e7917452", "Name"=>"mwarzynski-fluentbit-dev"}]
[1] messages: [1663702514.819949906, {"filepath"=>"/home/ubuntu/fluentbit-test-log", "log"=>"my name is mateusz", "ec2_instance_id"=>"i-0e66fc7f9809d7168", "CUSTOMER_ID"=>"373ae935-121e-4a4d-9d93-9c96e7917452", "Name"=>"mwarzynski-fluentbit-dev"}]
[2022/09/20 19:35:15] [debug] [out flush] cb_destroy coro_id=0
[2022/09/20 19:35:15] [debug] [task] destroy task=0x54eb120 (task_id=0)
^C[2022/09/20 19:35:52] [engine] caught signal (SIGINT)
[2022/09/20 19:35:52] [ warn] [engine] service will shutdown in max 5 seconds
[2022/09/20 19:35:52] [ info] [input] pausing tail.0
[2022/09/20 19:35:52] [ info] [engine] service has stopped (0 pending tasks)
[2022/09/20 19:35:52] [ info] [input] pausing tail.0
[2022/09/20 19:35:52] [debug] [input:tail:tail.0] inode=308488 removing file name /home/ubuntu/fluentbit-test-log
[2022/09/20 19:35:52] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=308488 watch_fd=1
[2022/09/20 19:35:52] [ info] [output:stdout:stdout.0] thread worker #0 stopping...
[2022/09/20 19:35:52] [ info] [output:stdout:stdout.0] thread worker #0 stopped
==49519==
==49519== HEAP SUMMARY:
==49519==     in use at exit: 0 bytes in 0 blocks
==49519==   total heap usage: 3,698 allocs, 3,698 frees, 2,090,942 bytes allocated
==49519==
==49519== All heap blocks were freed -- no leaks are possible
==49519==
==49519== For lists of detected and suppressed errors, rerun with: -s
==49519== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

Enter [N/A] in the box, if an item is not applicable to your change.

Testing Before we can approve your change; please submit the following in a comment:

  • [x] Example configuration file for the change
  • [x] Debug log output from testing the change
  • [x] Attached Valgrind output that shows no leaks or memory corruption was found

If this is a change to packaging of containers or native binaries then please confirm it works for all targets.

  • [N/A] Attached local packaging test output showing all targets (including any new ones) build.

Documentation

  • [x] Documentation required for this feature (PR created in the corresponding repository: https://github.com/fluent/fluent-bit-docs/pull/897)

Backporting

  • [N/A] Backport to latest stable release.

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

mwarzynski avatar Sep 17 '22 17:09 mwarzynski

@mwarzynski Thank you for contributing this!

I have read the description of the PR and I agree with your thinking/plan at a high level. I think simply fetching all tags on startup without refresh is a good start.

One note on the config options:

we could add tag_key_whitelist key1,key2 (which would work only if tags true is set), we could add tag_key_blacklist key1,key2 (which would work only if tags true is set), providing both tag_key_whitelist and tag_key_blacklist options is invalid,

I'd prefer include_tags and exclude_tags

PettitWesley avatar Sep 19 '22 17:09 PettitWesley

I updated the PR to follow the style guide and attached debug logs (with Valgrind). I also created a separate PR with the documentation for this feature (https://github.com/fluent/fluent-bit-docs/pull/897).

In my opinion, at this point this PR should be ready for review. @PettitWesley it would be awesome to get your feedback on the actual implementation!

I force pushed my changes, but since this PR is 'ready for review', going forward every new change will be added as a new commit.

Lastly, there is one thing that I would like to add: unit tests. However there were no unit tests for this module, so the effort to write ones is slightly higher (can't copy paste code used for already existing metadata calls :sweat_smile: ). I would be grateful for any suggestions how I could approach adding the unit test coverage.

mwarzynski avatar Sep 20 '22 20:09 mwarzynski

@mwarzynski I performed some initial review, you may ping me once all of these comments are addressed for the next round of review.

Thanks!

PettitWesley avatar Sep 21 '22 23:09 PettitWesley

@PettitWesley Thank your for the comprehensive code review. I am ready for the second round. :rocket:

mwarzynski avatar Sep 22 '22 14:09 mwarzynski

I've made an attempt to write tests for this feature: https://github.com/mwarzynski/fluent-bit/pull/1

  1. It uses Python3 to mock the HTTP server (not sure if it's okay to add this dependency / and how to start it up in a sane way),
  2. I have no clue how to properly modify CMakeFile to modify the host/port for the tests,
  3. Uses sleeps to synchronize (flakiness),
  4. It's a draft, so code quality is meh,
  5. but IT WORKS

mwarzynski avatar Sep 22 '22 19:09 mwarzynski

@mwarzynski Sorry I should have pinged this before, if possible, it would be ideal if we can re-use the stuff built by @matthewfala for mocking IMDS: https://github.com/fluent/fluent-bit/blob/master/tests/internal/aws_credentials_ec2.c#L92

https://github.com/fluent/fluent-bit/blob/master/tests/internal/aws_client_mock.c

It uses Python3 to mock the HTTP server (not sure if it's okay to add this dependency / and how to start it up in a sane way),

If possible I think we'd prefer to use the above approach since it means we can easily run the tests in the existing CI/CD and locally. Adding python and mocking a server could complicate that. I'm open to considering it but can you take a look at the above first please?

PettitWesley avatar Sep 26 '22 17:09 PettitWesley

Please feel free to reach out to me if you have any questions on how to use the AWS networking mock api. Please just add @matthewfala to your messages to make sure I see your comments.

matthewfala avatar Sep 26 '22 18:09 matthewfala

@matthewfala Would you be willing to provide a guidance how I should approach using the mentioned mock api for the aws filter tests? (Can I really use the mock api or rather you meant it's an example?) By guidance, I mean a few high level steps which you would've taken in order to properly implement these tests.

Based on what I understand, this "mock api" is rather an example how mocking http server could be implemented rather something ready to be used in filter_aws tests. At least I think this way, because the mock is made specifically for 'aws_client' whereas filter_aws makes the http requests from scratch (e.g. uses flb_upstream_create). In other words, there is no dependency injection, therefore mocks won't work without changing the code of filter_aws. It implies that first thing we need to do is to add another abstraction layer for the http client, which should be used inside filter_aws (such that it is mockable).

Adding an abstraction layer for the http client 'module' in the filter aws also means that we are about to create another 'library' for the EC2 metadata server client -- possibly called flb_aws_ec2_metadata or expand the flb_aws_imdbs (include/fluent-bit/aws/flb_aws_imds.h). Basically we need to have a separate component which will be responsible only for making the http requests.

Third part is how to actually define if mock should be used inside the filter_aws. In my original approach with Python, I simply overrode the header definitions to point to another address. I guess we could do something similar here too? Pseudocode:

cb_aws_init: // or other func

if (TESTS_RUNNING) {
  client = flb_aws_ec2_metadata_client_with_mock(...)
} else {
  client = flb_aws_ec2_metadata(...)
}

client->list_tags(&response)

I don't understand yet how the actual http mock works, but this should be pretty much transparent. Ideally, we probably should even be able to reuse the bare http mock between multiple test components.

To sum up, based on what I understand, the way to add tests with http client mocked in C:

  1. filter_aws: move out the http requests related code to module, just call the constructor flb_aws_ec2_metadata
  2. mock: add another constructor to create a mocked flb_aws_ec2_metadata (e.g. flb_aws_ec2_metadata_with_mock),
  3. filter_aws: add an if to use mocked client if some kind of header suggests we are running unit tests, (there is probably a better way to do this, but I don't know how we can modify the cb_aws_init signature to add such option)
  4. tests/filter_aws: try to reuse the setup_test code to reimplement the tests with new mock strategy

What do you think?


@PettitWesley Based on this quick overview of what needs to be done in order to add tests, I think this PR shouldn't require tests. Current implementation of filter_aws doesn't have any unit tests, so it probably makes sense to not hold this PR up to higher standards. I will try to create another PR which adds tests, but given how many changes need to be made (and how complex they are relatively to my skill in C), I can't promise on any timelines.

mwarzynski avatar Sep 28 '22 03:09 mwarzynski

Also, it would be nice if someone could confirm (through manual testing) that this functionality actually works on their EC2s. Although I've tested it a dozen times, there may be edge cases I haven't thought of.

mwarzynski avatar Sep 28 '22 04:09 mwarzynski

  1. filter_aws: move out the http requests related code to module, just call the constructor flb_aws_ec2_metadata

This one is actually easier than I thought, because there is a "duplicated" implementation for the IMDS client at include/fluent-bit/aws/flb_aws_imds.h.

mwarzynski avatar Sep 28 '22 07:09 mwarzynski

@matthewfala I have a work in progress branch for adding the unit tests with the http server mock in C: https://github.com/mwarzynski/fluent-bit/pull/2. Feel free to take a look and share your opinions about the general approach. Note that code-wise it's super dirty, because I try to make it work first (and only then iterate towards making it ready to push for the review).

mwarzynski avatar Sep 28 '22 08:09 mwarzynski

If we would enable all options for the AWS Filter, here is http server which allows to run & check the filter locally (not inside the EC2):

#!/usr/bin/env python3

from http.server import BaseHTTPRequestHandler, HTTPServer


class HTTPHandler(BaseHTTPRequestHandler):
    def _send_response(self, status_code: int = 200, body: str = None, content_type: str = "text/plain"):
        self.send_response(status_code)
        if body:
            self.send_header("Content-Type", content_type)
            self.send_header("Content-Length", len(body))
        self.end_headers()
        if body:
            self.wfile.write(body.encode("utf-8"))

    def do_GET(self):
        if self.path == "/":
            self._send_response()

        elif self.path == "/latest/meta-data/instance-id/":
            self._send_response(body="i-0e66fc7f9809d7168")

        elif self.path == "/latest/meta-data/instance-type/":
            self._send_response(body="t2.micro")

        elif self.path == "/latest/meta-data/local-ipv4/":
            self._send_response(body="10.158.112.84")

        elif self.path == "/latest/meta-data/mac/":
            self._send_response(body="00:00:5e:00:53:af")

        elif self.path == "/latest/meta-data/network/interfaces/macs/00:00:5e:00:53:af/vpc-id/":
            self._send_response(body="eth0")

        elif self.path == "/latest/meta-data/network/interfaces/macs/eth0/vpc-id/":
            self._send_response(body="vpc-12313213")

        elif self.path == "/latest/meta-data/ami-id/":
            self._send_response(body="ami-5fb8c835")

        elif self.path == "/latest/dynamic/instance-identity/document/":
            self._send_response(
                body="""{
    "devpayProductCodes" : null,
    "marketplaceProductCodes" : [ "1abc2defghijklm3nopqrs4tu" ],
    "availabilityZone" : "us-east-1a",
    "privateIp" : "10.158.112.84",
    "version" : "2017-09-30",
    "instanceId" : "i-1234567890abcdef0",
    "billingProducts" : null,
    "instanceType" : "t2.micro",
    "accountId" : "123456789012",
    "imageId" : "ami-5fb8c835",
    "pendingTime" : "2016-11-19T16:32:11Z",
    "architecture" : "x86_64",
    "kernelId" : null,
    "ramdiskId" : null,
    "region" : "us-east-1"
}""",
                content_type="application/json",
            )

        elif self.path == "/latest/meta-data/hostname/":
            self._send_response(body="ip-10-158-112-84.us-west-2.compute.internal")

        elif self.path == "/latest/meta-data/placement/availability-zone/":
            self._send_response(body="us-east-1a")

        elif self.path == "/latest/meta-data/tags/instance":
            # If there are 0 tags, AWS returns a 404.
            # self._send_response(status_code=404)
            self._send_response(body="Name\nCUSTOMER_ID")

        elif self.path == "/latest/meta-data/tags/instance/Name":
            self._send_response(body="mwarzynski-fluentbit-dev")

        elif self.path == "/latest/meta-data/tags/instance/CUSTOMER_ID":
            self._send_response(body="70ec5c04-3a6e-11ed-a261-0242ac120002")

        else:
            self._send_response(status_code=500)


def run():
    httpd = HTTPServer(("", 8000), HTTPHandler)
    try:
        httpd.serve_forever()
    except KeyboardInterrupt:
        pass
    httpd.server_close()


if __name__ == "__main__":
    run()

mwarzynski avatar Sep 28 '22 09:09 mwarzynski

Do you know what I need to do in order to satisfy the documentation requirement? Why is the docs-required still there?

mwarzynski avatar Sep 28 '22 09:09 mwarzynski

@mwarzynski Thanks for the doc PR, that's all that is needed on that front: https://github.com/fluent/fluent-bit-docs/pull/897

I will review that and this soon. I was on vacation last week.

PettitWesley avatar Oct 03 '22 20:10 PettitWesley

@mwarzynski Before merge, this should be two commits probably. One for all aws filter code changes, and one for the testing changes.

PettitWesley avatar Oct 03 '22 21:10 PettitWesley

Before merge, this should be two commits probably. One for all aws filter code changes, and one for the testing changes.

@PettitWesley I will rebase on latest master and squash all commits when the end state is reached (no more review comments / all conversation threads are resolved).

As for the tests, feel free to check my tests implementation and provide your opinion, but I'm afraid the PR of the tests contains too many changes to squeeze in here. (PR: https://github.com/mwarzynski/fluent-bit/pull/2)

mwarzynski avatar Oct 05 '22 23:10 mwarzynski

I have two Valgrind complaints in the current version.

==35884== Thread 2 flb-pipeline:
==35884== Conditional jump or move depends on uninitialised value(s)
==35884==    at 0x483BC98: strlen (vg_replace_strmem.c:459)
==35884==    by 0x4E9AF75: __vfprintf_internal (vfprintf-internal.c:1688)
==35884==    by 0x4EAC9C5: __vsnprintf_internal (vsnprintf.c:114)
==35884==    by 0x18C7E8: flb_log_construct (flb_log.c:410)
==35884==    by 0x18CA53: flb_log_print (flb_log.c:469)
==35884==    by 0x5D9186: get_ec2_tag_keys (aws.c:538)
==35884==    by 0x5D94D6: get_ec2_tags (aws.c:622)
==35884==    by 0x5D98C3: get_ec2_metadata (aws.c:731)
==35884==    by 0x5D805A: cb_aws_init (aws.c:187)
==35884==    by 0x1BBC72: flb_filter_init_all (flb_filter.c:511)
==35884==    by 0x1E4898: flb_engine_start (flb_engine.c:714)
==35884==    by 0x18B6E9: flb_lib_worker (flb_lib.c:629)
==35884== Conditional jump or move depends on uninitialised value(s)
==35884==    at 0x483BC98: strlen (vg_replace_strmem.c:459)
==35884==    by 0x4E9AF75: __vfprintf_internal (vfprintf-internal.c:1688)
==35884==    by 0x4EAC9C5: __vsnprintf_internal (vsnprintf.c:114)
==35884==    by 0x1A1D48: flb_sds_printf (flb_sds.c:429)
==35884==    by 0x5D9334: get_ec2_tag_values (aws.c:586)
==35884==    by 0x5D94F0: get_ec2_tags (aws.c:626)
==35884==    by 0x5D98C3: get_ec2_metadata (aws.c:731)
==35884==    by 0x5D805A: cb_aws_init (aws.c:187)
==35884==    by 0x1BBC72: flb_filter_init_all (flb_filter.c:511)
==35884==    by 0x1E4898: flb_engine_start (flb_engine.c:714)
==35884==    by 0x18B6E9: flb_lib_worker (flb_lib.c:629)
==35884==    by 0x4853EA6: start_thread (pthread_create.c:477)

@PettitWesley Do you have any suggestions how I could approach fixing these?

mwarzynski avatar Oct 05 '22 23:10 mwarzynski

@mwarzynski

As for the tests, feel free to check my tests implementation and provide your opinion, but I'm afraid the PR of the tests contains too many changes to squeeze in here. (PR: https://github.com/mwarzynski/fluent-bit/pull/2)

I understand it will make the PR big, but its best to have the new feature and tests all in the same PR. That way the CI will auto run them and we can validate it works. And check the tests with the implementation. If you check some of my old PRs for new plugins, they are huge.

PettitWesley avatar Oct 06 '22 00:10 PettitWesley

I have two Valgrind complaints in the current version.

Sorry I am not sure what caused these @mwarzynski... if Valgrind reports no memory leaks then that's the main thing I care about.

Both seem to involve the use of ctx->tag_keys... I'm wondering if Valgrind incorrectly thinks the value there isn't set... that's my best guess.

PettitWesley avatar Oct 06 '22 00:10 PettitWesley

I understand it will make the PR big, but its best to have the new feature and tests all in the same PR.

I agree, ideally I would really like to attach tests in this PR too.

However, I am not concerned about changes in terms of LOC. I am worried about changes which I had to make in aws.c, because they change how Filter AWS makes http requests and handles tokens. In particular, to mock the requests in tests, we must use fluent-bit/aws/flb_aws_imds.h in aws.c. Since there are no integration tests, it's difficult for me to confirm everything works correctly afterwards. It's more than I am comfortable with right now. Apart from the code changes, I assume it needs discussion how to test it out properly (and manual tests for multiple scenarios).

That way the CI will auto run them and we can validate it works. And check the tests with the implementation.

I am on the same side and I would really like to have the tests in place. Furthermore, I am quite certain that we will add tests for this feature sooner or later. On the other hand, I also think we should note that:

  1. CI may validate if it works, but correctness might be checked manually in the short term, (under the assumption that Filter AWS part of the code won't change at all until unit tests are added),
  2. EC2 tags functionality effectively requires communication with external server which is defined outside of fluent-bit; QA should include checking out the functionality manually on the EC2 (at least once, before the merge); i.e. it requires manual work if unit tests are aligned with EC2 Metadata server; thus you probably need to test this code at least once manually anyway, and therefore check the implementation regardless if there are unit tests or not;
  3. If there are no plans to change the AWS Filter in the short term, it won't matter if we add unit tests in the same PR or the very next one; (what's important is that unit tests must be added at some point)

To sum up, if asked, I would be fine with including unit tests in this PR. However, I don't feel comfortable right now with including them here due to changing the untested parts of the code and possible unnoticed regression. I am going to need some help from you in order to ensure it's implemented properly (which will take some time). For this reason, given no apparent downsides (under the condition that unit tests will be added as the very next thing) I would just limit the PR to the current scope.

@PettitWesley What do you think?

mwarzynski avatar Oct 06 '22 09:10 mwarzynski

@mwarzynski with regards to testing, my main concern is that prior to any release we must have BOTH the implementation fully completed and tested in the real world AND unit tests that can pass CI.

And as soon as we merge something into master, that is basically saying it's ready for release. I don't control the releases in this project, the other maintainers do that, so the communication to say- "hey we merged this but we can't have a release until we also merge another PR which isn't complete yet"- will not be well received. It would be very inconvenient for the other maintainers.

So what I can give you is two options:

  1. Keep the current PR against master. We can keep working on it slowly, reviewing and finalizing the feature first and then testing it in the real world. Once that's done we can then work on the unit tests. And then when it's all ready we can merge.
  2. I believe I have the power to create branches in this repo, so I could create a special purpose branch aws-ec2-tagging that you can first merge the feature into and then in a separate second PR merge the tests. And then we merge that branch into master.

I am worried about changes which I had to make in aws.c, because they change how Filter AWS makes http requests and handles tokens.

I am confused by this part- I took a brief look over your linked test PR on your fork and I was not able to understand why so many changes were needed in the aws.c filter code. Can you please help my understand that?

If you are having difficulties figuring out how to write unit tests, @matthewfala can help. Though please be aware he is very busy right now and might not get to this until next week.

PettitWesley avatar Oct 06 '22 20:10 PettitWesley

we must have BOTH the implementation fully completed and tested in the real world AND unit tests that can pass CI

Out of curiosity, so why the Filter AWS has 0 unit tests right now?

Keep the current PR against master.

Okay, let's do it. I hope that complexity of this PR won't be too large to handle.

I am worried about changes which I had to make in aws.c, because they change how Filter AWS makes http requests and handles tokens.

I am confused by this part- I took a brief look over your linked test PR on your fork and I was not able to understand why so many changes were needed in the aws.c filter code. Can you please help my understand that?

Unit tests require mocking http requests, right? In the current version of the code, making http requests is hardcoded (literally we create upstream and do the request inside Filter AWS), so it's not testable. Based on my understanding, we want to rely on mocking API implemented by @matthewfala. To get this working we need to use "his" implementation of the aws client (and imds client) in Filter AWS. Otherwise I don't see how we could inject the request mocks in tests. On the higher level, we need some kind of abstraction for the http client, so that we can mock their execution. So, again, either way, we must change the Filter AWS code to a substantial degree (it simply needs another abstraction layer). In particular, I had to delete all hardcoded logic related to making http requests in Filter AWS and instead use methods from fluent-bit/aws/flb_aws_imds.h. That's why there are so many changes.

If you are having difficulties figuring out how to write unit tests,

Unit tests are almost finished (code-wise). There is one thing left that I need to figure out. I don't know how to properly free the memory used for aws client mocks generator. Valgrind complains during the unit test execution. Apart from that, it works fine.

Since we are going to change the untested code, I will need your help with testing my changes manually, so that we can ensure it works correctly overall. I am quite sure that I don't have sufficient knowledge to properly test it out on my own.

Anyway, I will add the unit tests to this PR and ask for another review.

mwarzynski avatar Oct 07 '22 00:10 mwarzynski

I pushed the "new" version which uses flb_aws_imds http client and contains unit tests. It's still work in progress, because unit tests are not finished yet. Filter AWS functionality should work fine though (tested with Valgrind and "no leaks are possible"). In the end, I've decided to push my changes, so that maybe you could help me figure out what I'm doing wrong with the unit tests setup.

In particular, I haven't figured out how to properly free the memory used by aws client mock generator. Do you have any tips?

mwarzynski avatar Oct 10 '22 13:10 mwarzynski

I've pushed a fix for memory leaks in the unit tests for filter_aws. (I am not a fan of adding coupling in the aws.c code to the filter_aws tests, but I didn't find any other way to fix memory leaks.)

mwarzynski@3afd5d58db28:/fluent-bit-dev/build$ valgrind --track-origins=yes --leak-check=full ./bin/flb-rt-filter_aws

...

SUCCESS: All unit tests have passed.
==34840==
==34840== HEAP SUMMARY:
==34840==     in use at exit: 216 bytes in 2 blocks
==34840==   total heap usage: 2,328 allocs, 2,326 frees, 1,239,509 bytes allocated
==34840==
==34840== Thread 1:
==34840== 216 bytes in 2 blocks are definitely lost in loss record 1 of 1
==34840==    at 0x483877F: malloc (vg_replace_malloc.c:307)
==34840==    by 0x51EDD9: flb_malloc (flb_mem.h:79)
==34840==    by 0x522BA8: out_lib_flush (out_lib.c:177)
==34840==    by 0x166234: output_pre_cb_flush (flb_output.h:517)
==34840==    by 0x950E66: co_init (amd64.c:117)
==34840==
==34840== LEAK SUMMARY:
==34840==    definitely lost: 216 bytes in 2 blocks
==34840==    indirectly lost: 0 bytes in 0 blocks
==34840==      possibly lost: 0 bytes in 0 blocks
==34840==    still reachable: 0 bytes in 0 blocks
==34840==         suppressed: 0 bytes in 0 blocks
==34840==
==34840== For lists of detected and suppressed errors, rerun with: -s
==34840== ERROR SUMMARY: 28 errors from 23 contexts (suppressed: 0 from 0)

There still seem to be some memory leaks, but they are not coming from the code I've changed. However if you know how I could fix these, please let me know.


I didn't change anything which might have an impact on the behavior of the application itself, but let's check if it still works. Logs from running the fluent-bit binary:

mwarzynski@3afd5d58db28:/fluent-bit-dev/build$ valgrind ./bin/fluent-bit -c fluent-bit.conf
==34836== Memcheck, a memory error detector
==34836== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==34836== Using Valgrind-3.16.1 and LibVEX; rerun with -h for copyright info
==34836== Command: ./bin/fluent-bit -c fluent-bit.conf
==34836==
Fluent Bit v2.0.0
* Copyright (C) 2015-2022 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2022/10/16 13:24:24] [ info] Configuration:
[2022/10/16 13:24:24] [ info]  flush time     | 1.000000 seconds
[2022/10/16 13:24:24] [ info]  grace          | 5 seconds
[2022/10/16 13:24:24] [ info]  daemon         | 0
[2022/10/16 13:24:24] [ info] ___________
[2022/10/16 13:24:24] [ info]  inputs:
[2022/10/16 13:24:24] [ info]      tail
[2022/10/16 13:24:24] [ info] ___________
[2022/10/16 13:24:24] [ info]  filters:
[2022/10/16 13:24:24] [ info]      aws.0
[2022/10/16 13:24:24] [ info] ___________
[2022/10/16 13:24:24] [ info]  outputs:
[2022/10/16 13:24:24] [ info]      stdout.0
[2022/10/16 13:24:24] [ info] ___________
[2022/10/16 13:24:24] [ info]  collectors:
[2022/10/16 13:24:24] [ info] [fluent bit] version=2.0.0, commit=905cfbad0b, pid=34836
[2022/10/16 13:24:24] [debug] [engine] coroutine stack size: 24576 bytes (24.0K)
[2022/10/16 13:24:24] [ info] [storage] version=1.3.0, type=memory-only, sync=normal, checksum=disabled, max_chunks_up=128
[2022/10/16 13:24:24] [ info] [cmetrics] version=0.4.0
[2022/10/16 13:24:25] [debug] [tail:tail.0] created event channels: read=21 write=22
[2022/10/16 13:24:25] [debug] [input:tail:tail.0] flb_tail_fs_inotify_init() initializing inotify tail input
[2022/10/16 13:24:25] [debug] [input:tail:tail.0] inotify watch fd=27
[2022/10/16 13:24:25] [debug] [input:tail:tail.0] scanning path /fluent-bit-dev/build/input.log
==34836== Thread 2 flb-pipeline:
==34836== Conditional jump or move depends on uninitialised value(s)
==34836==    at 0x483BC98: strlen (vg_replace_strmem.c:459)
==34836==    by 0x4E9AF75: __vfprintf_internal (vfprintf-internal.c:1688)
==34836==    by 0x4EAC9C5: __vsnprintf_internal (vsnprintf.c:114)
==34836==    by 0x18C7E8: flb_log_construct (flb_log.c:410)
==34836==    by 0x18CA53: flb_log_print (flb_log.c:469)
==34836==    by 0x5D8A02: get_ec2_tag_keys (aws.c:392)
==34836==    by 0x5D8D56: get_ec2_tags (aws.c:476)
==34836==    by 0x5D912A: get_ec2_metadata (aws.c:577)
==34836==    by 0x5D8208: cb_aws_init (aws.c:207)
==34836==    by 0x1BBC72: flb_filter_init_all (flb_filter.c:511)
==34836==    by 0x1E4898: flb_engine_start (flb_engine.c:714)
==34836==    by 0x18B6E9: flb_lib_worker (flb_lib.c:629)
==34836==
[2022/10/16 13:24:25] [debug] [input:tail:tail.0] inode=18355229 with offset=0 appended as /fluent-bit-dev/build/input.log
[2022/10/16 13:24:25] [debug] [input:tail:tail.0] scan_glob add(): /fluent-bit-dev/build/input.log, inode 18355229
[2022/10/16 13:24:25] [debug] [input:tail:tail.0] 1 new files found on path '/fluent-bit-dev/build/input.log'
==34836== Conditional jump or move depends on uninitialised value(s)
==34836==    at 0x483BC98: strlen (vg_replace_strmem.c:459)
==34836==    by 0x4E9AF75: __vfprintf_internal (vfprintf-internal.c:1688)
==34836==    by 0x4EAC9C5: __vsnprintf_internal (vsnprintf.c:114)
==34836==    by 0x1A1D48: flb_sds_printf (flb_sds.c:429)
==34836==    by 0x5D8BB0: get_ec2_tag_values (aws.c:440)
==34836==    by 0x5D8D70: get_ec2_tags (aws.c:480)
==34836==    by 0x5D912A: get_ec2_metadata (aws.c:577)
==34836==    by 0x5D8208: cb_aws_init (aws.c:207)
==34836==    by 0x1BBC72: flb_filter_init_all (flb_filter.c:511)
==34836==    by 0x1E4898: flb_engine_start (flb_engine.c:714)
==34836==    by 0x18B6E9: flb_lib_worker (flb_lib.c:629)
==34836==    by 0x4853EA6: start_thread (pthread_create.c:477)
==34836==
[2022/10/16 13:24:25] [debug] [imds] using IMDSv1
[2022/10/16 13:24:25] [debug] [http_client] not using http_proxy for header
[2022/10/16 13:24:25] [debug] [filter:aws:aws.0] tag found: Name (len=4)
[2022/10/16 13:24:25] [debug] [filter:aws:aws.0] tag found: CUSTOMER_ID (len=11)
[2022/10/16 13:24:25] [debug] [imds] using IMDSv1
[2022/10/16 13:24:25] [debug] [http_client] not using http_proxy for header
[2022/10/16 13:24:25] [debug] [imds] using IMDSv1
[2022/10/16 13:24:25] [debug] [http_client] not using http_proxy for header
[2022/10/16 13:24:25] [debug] [stdout:stdout.0] created event channels: read=29 write=30
[2022/10/16 13:24:25] [debug] [router] match rule tail.0:stdout.0
[2022/10/16 13:24:25] [ info] [sp] stream processor started
[2022/10/16 13:24:25] [debug] [input chunk] update output instances with new chunk size diff=142
[2022/10/16 13:24:25] [ info] [output:stdout:stdout.0] worker #0 started
[2022/10/16 13:24:25] [debug] [input:tail:tail.0] [static files] processed 5b
[2022/10/16 13:24:25] [debug] [input:tail:tail.0] inode=18355229 file=/fluent-bit-dev/build/input.log promote to TAIL_EVENT
[2022/10/16 13:24:25] [ info] [input:tail:tail.0] inotify_fs_add(): inode=18355229 watch_fd=1 name=/fluent-bit-dev/build/input.log
[2022/10/16 13:24:25] [debug] [input:tail:tail.0] [static files] processed 0b, done
[2022/10/16 13:24:26] [debug] [task] created task=0x533f990 id=0 OK
[2022/10/16 13:24:26] [debug] [output:stdout:stdout.0] task_id=0 assigned to thread #0
[0] messages: [1665926665.527303267, {"filepath"=>"/fluent-bit-dev/build/input.log", "log"=>"test", "Name"=>"mwarzynski-fluentbit-dev", "CUSTOMER_ID"=>"70ec5c04-3a6e-11ed-a261-0242ac120002"}]
[2022/10/16 13:24:26] [debug] [out flush] cb_destroy coro_id=0
[2022/10/16 13:24:26] [debug] [task] destroy task=0x533f990 (task_id=0)
^C[2022/10/16 13:24:27] [engine] caught signal (SIGINT)
[2022/10/16 13:24:27] [ warn] [engine] service will shutdown in max 5 seconds
[2022/10/16 13:24:27] [ info] [input] pausing tail.0
[2022/10/16 13:24:28] [ info] [engine] service has stopped (0 pending tasks)
[2022/10/16 13:24:28] [ info] [input] pausing tail.0
[2022/10/16 13:24:28] [debug] [input:tail:tail.0] inode=18355229 removing file name /fluent-bit-dev/build/input.log
[2022/10/16 13:24:28] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=18355229 watch_fd=1
[2022/10/16 13:24:28] [ info] [output:stdout:stdout.0] thread worker #0 stopping...
[2022/10/16 13:24:28] [ info] [output:stdout:stdout.0] thread worker #0 stopped
==34836==
==34836== HEAP SUMMARY:
==34836==     in use at exit: 0 bytes in 0 blocks
==34836==   total heap usage: 1,600 allocs, 1,600 frees, 758,233 bytes allocated
==34836==
==34836== All heap blocks were freed -- no leaks are possible
==34836==
==34836== Use --track-origins=yes to see where uninitialised values come from
==34836== For lists of detected and suppressed errors, rerun with: -s
==34836== ERROR SUMMARY: 6 errors from 2 contexts (suppressed: 0 from 0)

"no leaks are possible" :rocket:

mwarzynski avatar Oct 16 '22 13:10 mwarzynski

@PettitWesley Would you please check out my PR once again?

mwarzynski avatar Oct 16 '22 13:10 mwarzynski

@PettitWesley @matthewfala Would you please help me with fixing the memory leaks in unit tests? What am I doing wrong?


EDIT: Okay, I think I've figured it out.

void set_output(char *val)
{
    pthread_mutex_lock(&result_mutex);
    if (output) {
        free(output); // fix: we need to free the output bytes returned from the fluentbit.
    }
    output = val;
    pthread_mutex_unlock(&result_mutex);
}
SUCCESS: All unit tests have passed.
==16624==
==16624== HEAP SUMMARY:
==16624==     in use at exit: 0 bytes in 0 blocks
==16624==   total heap usage: 5,796 allocs, 5,796 frees, 3,137,849 bytes allocated
==16624==
==16624== All heap blocks were freed -- no leaks are possible
==16624==

EDIT2: rebased on the latest master.

mwarzynski avatar Oct 28 '22 03:10 mwarzynski

@mwarzynski Can you please squash this to a single or a few commits as we are getting closer to merge?

PettitWesley avatar Nov 01 '22 23:11 PettitWesley

@mwarzynski Also I just tried testing it myself and it doesn't work:

FB output:

$ ./bin/fluent-bit -c ~/configs/aws.conf
Fluent Bit v2.0.2
* Copyright (C) 2015-2022 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2022/11/01 23:43:19] [ info] [fluent bit] version=2.0.2, commit=0978166b3e, pid=10673
[2022/11/01 23:43:19] [ info] [storage] ver=1.3.0, type=memory, sync=normal, checksum=off, max_chunks_up=128
[2022/11/01 23:43:19] [ info] [cmetrics] version=0.5.4
[2022/11/01 23:43:19] [ info] [ctraces ] version=0.2.5
[2022/11/01 23:43:19] [ info] [input:dummy:dummy.0] initializing
[2022/11/01 23:43:19] [ info] [input:dummy:dummy.0] storage_strategy='memory' (memory only)
[2022/11/01 23:43:19] [error] [filter:aws:aws.0] no value for tag keyt
[2022/11/01 23:43:19] [error] [filter:aws:aws.0] Could not retrieve ec2 metadata from IMDS on initialization
[2022/11/01 23:43:19] [ info] [sp] stream processor started
[2022/11/01 23:43:19] [ info] [output:stdout:stdout.0] worker #0 started
[0] dummy: [1667346199.293417816, {"message"=>"dummy", "az"=>"us-west-2b", "ec2_instance_id"=>"i-777777777777"}]
[0] dummy: [1667346200.293395570, {"message"=>"dummy", "az"=>"us-west-2b", "ec2_instance_id"=>"i-777777777777"}]

IMDS output:

$ curl http://169.254.169.254/latest/meta-data/tags/instance
Name
key

$ curl http://169.254.169.254/latest/meta-data/tags/instance/Name && echo && curl http://169.254.169.254/latest/meta-data/tags/instance/key && echo
DEV_DSK_EC2_4
value

State of the code when I checked it out:

$ git status && git log --oneline
On branch filter-aws-load-tags
Your branch is up to date with 'mateusz/filter-aws-load-tags'.

nothing to commit, working tree clean
0978166b3 (HEAD -> filter-aws-load-tags, mateusz/filter-aws-load-tags) test: fix memory leaks in filter aws tests
f40b0f0bc test: filter aws for tags values error status codes
5dfe78fab filter: fix error handling if tags can't be fetched
f467f2adf filter: rename client_aws to aws_ec2_filter_client
83bdcca57 filter: use `tmp` as return var for `flb_sds_printf`
ec6dbca6b filter: use `int` instead of `size_t` in `for` loop
601129b57 filter: aws client upstream do `flb_stream_disable_async_mode`
79b571099 tests: fix memory leaks in `filter_aws`
ab1a6a5ed tests: make `filter_aws.c` tests deterministic
324a5f8d5 tests: add newline char at the end of filter_aws.c
f7fdeec3f filter: fix freeing memory if `flb_sds_printf` failed

PettitWesley avatar Nov 01 '22 23:11 PettitWesley

Can you please squash this to a single or a few commits as we are getting closer to merge?

Yes, of course. I will do it today.

I just tried testing it myself and it doesn't work

I tried to set up the same settings on AWS and can't reproduce your issue.

$ curl http://169.254.169.254/latest/meta-data/tags/instance && echo
Name
key
$ curl http://169.254.169.254/latest/meta-data/tags/instance/Name && echo && curl http://169.254.169.254/latest/meta-data/tags/instance/key && echo
mwarzynski-fluent-bit-dev
value
[0] messages: [1667378469.488651128, {"filepath"=>"/home/centos/fluentbit-test-log", "log"=>"test", "ec2_instance_id"=>"i-0e77add13f0915d1f", "Name"=>"mwarzynski-fluent-bit-dev", "key"=>"value"}]

However, error message in your log output suggests that something went wrong while parsing tag keys.

[filter:aws:aws.0] no value for tag keyt

It tried to reach for keyt whereas it was expected as just key. I will try to reproduce it myself with my mocked Python EC2 metadata server.

mwarzynski avatar Nov 02 '22 08:11 mwarzynski

@PettitWesley I think I fixed the issue (one-line added, see the comment). I also squashed all the commits to just 1.

mwarzynski avatar Nov 02 '22 09:11 mwarzynski