Description of the issue

Addresses https://github.com/aws/amazon-cloudwatch-agent/issues/172

Frequently a specific EC2 deployment will have a set of Dimensions that should be associated with all published metrics
Support this in the metrics configuration

Description of changes

Add a global_dimensions property to the metric configuration to support this use-case

  {
    "agent": {
      "region": "us-east-1"
    },
    "metrics": {
      "append_dimensions": {
        "AutoScalingGroupName": "${aws:AutoScalingGroupName}",
        "ImageId": "${aws:ImageId}",
        "InstanceId": "${aws:InstanceId}",
        "InstanceType": "${aws:InstanceType}"
      },
      "global_dimensions": {
        "Environment": "test",
        "DeploymentStack": "green"
      },
      "metrics_collected": {
        "mem": {
          "measurement": ["mem_used_percent"]
        },
        "disk": {
          "measurement": ["used_percent"],
          "resources": ["*"]
        }
      }
    }
  }

I opted not to use the existing append_dimensions property because:
- The code required is quite different for the two use-cases
- This could be a breaking change if people have non aws: values in the append_dimensions field
With the current implementation, any dimension values defined in global_dimensions override dimensions set for individual metrics. This can easily be altered.

License

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Tests

I have added / adjusted tests for:

JSON to TOML translation of the configuration
Extended tests for BuildDimensions to include the cases where global dimensions are provided

Requirements

Before commit the code, please do the following steps.

Run make fmt and make fmt-sh
Run make linter

Are these instructions up-to-date? make linter fails completely and make fmt and make fmt-sh make loads of changes to files I've not touched.

Jan 27 '23 15:01 paulbrimicombe

Have you built the agent locally and ran this end-to-end? I think that's one thing we'll need to confirm before accepting the change

Jan 27 '23 18:01 SaxyPandaBear

Codecov Report

Merging #673 (8b1fc4f) into main (081cc0b) will increase coverage by 0.03%. The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #673      +/-   ##
==========================================
+ Coverage   56.87%   56.90%   +0.03%     
==========================================
  Files         364      376      +12     
  Lines       17049    17935     +886     
==========================================
+ Hits         9696    10206     +510     
- Misses       6802     7135     +333     
- Partials      551      594      +43

Impacted Files	Coverage Δ
translator/config/schema.go	`81.81% <ø> (ø)`
translator/totomlconfig/toTomlConfig.go	`72.72% <ø> (ø)`
plugins/outputs/cloudwatch/cloudwatch.go	`76.34% <100.00%> (+1.72%)`	:arrow_up:
...late/metrics/global_dimensions/globalDimensions.go	`100.00% <100.00%> (ø)`
...puts/logfile/tail/file_deleting_checker_windows.go	`36.36% <0.00%> (ø)`
...windows_event_log/wineventlog/wineventlogrecord.go	`62.50% <0.00%> (ø)`
plugins/inputs/win_perf_counters/pdh.go	`65.47% <0.00%> (ø)`
plugins/inputs/logfile/tmpfile_windows.go	`40.00% <0.00%> (ø)`
...nputs/windows_event_log/wineventlog/wineventlog.go	`36.40% <0.00%> (ø)`
...s/inputs/windows_event_log/wineventlog/sys_call.go	`47.36% <0.00%> (ø)`
... and 7 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

Jan 31 '23 14:01 codecov-commenter

LGTM but looks like failed unit tests after the change. Maybe I didn't look hard enough at the unit test code. Once it's passing I'm good to approve

Feb 10 '23 16:02 SaxyPandaBear

Another interesting find I have found is that: we could leverage global_tags from telegraf. However, it would be override by tags within each plugin. Have we considered this option before? Will try to look the PR EOD.

Feb 10 '23 16:02 khanhntd

This could be a breaking change if people have non aws: values in the append_dimensions field

I took this at face value, but after reviewing with product today, I think I don't understand the concern here. Don't we fail config translation today if you supply a non aws: prefixed value in the outer append_dimensions configuration?

If there's no real concern about breaking existing customers, I think it makes sense to consolidate the logic into the existing append_dimensions field, though I do hate pushing for that so late after the fact. Realistically, the existing JSON structure is already confusing enough as it is, so I think we have to be diligent with knowing what warrants adding more bloat to it.

Separate thing to validate with the change e2e regardless of the above - it was brought to my attention that there are some metrics configurations that do not support an inner append_dimensions field. One thing that we'll need to verify is that the global dimensions are added regardless of that fact. For example, our statsd plugin does not allow for statsd-specific append dimensions, but it's likely that if you are trying to append global dimensions, you'll want that to be present

Feb 10 '23 22:02 SaxyPandaBear

LGTM but looks like failed unit tests after the change. Maybe I didn't look hard enough at the unit test code. Once it's passing I'm good to approve

Ah yes. Sorry I didn't spot these before (some of the TestSEH1Distribution tests always fail on my Mac so it's sometimes hard to spot genuine failures). I've pushed a fix for the broken tests now.

Feb 13 '23 12:02 paulbrimicombe

I took this at face value, but after reviewing with product today, I think I don't understand the concern here. Don't we fail config translation today if you supply a non aws: prefixed value in the outer append_dimensions configuration?

If there's no real concern about breaking existing customers, I think it makes sense to consolidate the logic into the existing append_dimensions field, though I do hate pushing for that so late after the fact. Realistically, the existing JSON structure is already confusing enough as it is, so I think we have to be diligent with knowing what warrants adding more bloat to it.

I totally agree about limiting the complexity of the configuration but I've tried supplying non aws: prefixed values and they seem to be silently dropped. Certainly no tests fail if you e.g. add some in here: https://github.com/aws/amazon-cloudwatch-agent/blob/main/translator/config/sampleSchema/validLinuxMetrics.json#L106 and I can't find any code in the translator/translate/metrics/append_dimensions file tree that would reject a value that doesn't match one of the currently-supported values.

Feb 13 '23 12:02 paulbrimicombe

Separate thing to validate with the change e2e regardless of the above - it was brought to my attention that there are some metrics configurations that do not support an inner append_dimensions field. One thing that we'll need to verify is that the global dimensions are added regardless of that fact. For example, our statsd plugin does not allow for statsd-specific append dimensions, but it's likely that if you are trying to append global dimensions, you'll want that to be present

I can't see any reason why the global dimensions wouldn't be added provided they still use the CloudWatch output plugin but it would be good to check end-to-end if possible.

Feb 13 '23 12:02 paulbrimicombe

and I can't find any code in the translator/translate/metrics/append_dimensions file tree that would reject a value that doesn't match one of the currently-supported values.

Ok. I'll take a look today to try to trace through where they're getting silently dropped.

Feb 13 '23 14:02 SaxyPandaBear

--- FAIL: TestBackoffRetries (66.47s)
    cloudwatch_test.go:807: 
        	Error Trace:	cloudwatch_test.go:807
        	Error:      	"1s" is not greater than "1.022734529s"
        	Test:       	TestBackoffRetries
FAIL

looks like a weird failure. I'm assuming this is flakiness and not related to your change. Will rerun the mac tests

Feb 13 '23 14:02 SaxyPandaBear

@paulbrimicombe so in the append_dimensions translation, we register a specific set of rules to apply. The key problem that is causing unrecognized values to be dropped is that there isn't a matching rule. It looks kind of messy and not fit for arbitrary strings, unfortunately. Because there is no child rule that extracts the arbitrary key-value pairs, they never make it into the translated config.

Feb 13 '23 15:02 SaxyPandaBear

The key problem that is causing unrecognized values to be dropped is that there isn't a matching rule. It looks kind of messy and not fit for arbitrary strings, unfortunately. Because there is no child rule that extracts the arbitrary key-value pairs, they never make it into the translated config.

@SaxyPandaBear I think this is all deliberate in that the original developers wanted to drop unsupported keys and the easiest way to do that was to iterate through the supported keys. It wouldn't be particularly difficult to extract any additional keys but my worry is that people who do have additional keys in their config already will suddenly have new dimensions appended to their metrics which would be a breaking change.

Feb 14 '23 09:02 paulbrimicombe

my worry is that people who do have additional keys in their config already will suddenly have new dimensions appended to their metrics which would be a breaking change

I'll start a conversation internally about it. My gut reaction is that if someone has configured an arbitrary key-value pair in their append dimensions and hasn't complained that they don't show up as dimensions on CW, they don't care either way. I'm a little less worried about this being a breaking change because it's meant for static strings, at least from a cost perspective. I can get confirmation to be sure, but I don't think that customers who do this should receive a spike in CW billing. If you start with [instanceID, autoScalingGroupName], there's already cardinality there so you're paying for unique metrics per instance ID and ASG name. Adding "foo=bar" to the dimension set is still going to produce the same number of metrics, just with another dimensions. At least that's what my gut tells me.

Back to the main point, I don't even really see an issue operationally. I just tested something out myself. I set up a config with just instance ID as the global append dimension:

{
        "agent": {
                "metrics_collection_interval": 60,
                "run_as_user": "root"
        },
        "metrics": {
                "append_dimensions": {
                        "InstanceId": "${aws:InstanceId}"
                },
                "metrics_collected": {
                        "mem": {
                                "measurement": [
                                        "mem_used_percent"
                                ],
                                "metrics_collection_interval": 60
                        }
                }
        }
}

Then I set up a dashboard widget for the mem used metric, and also set an alarm.

Afterwards, I added InstanceType as a dimension:

{
        "agent": {
                "metrics_collection_interval": 60,
                "run_as_user": "root"
        },
        "metrics": {
                "append_dimensions": {
                        "InstanceId": "${aws:InstanceId}",
                        "InstnaceType": "${aws:InstanceType}"
                },
                "metrics_collected": {
                        "mem": {
                                "measurement": [
                                        "mem_used_percent"
                                ],
                                "metrics_collection_interval": 60
                        }
                }
        }
}

And monitored the alarm / dashboard widget. The metrics still flowed to the alarm and widget. 14:49 UTC is when I restarted the agent with the updated config, so that's why I highlighted that timestamp in the metric Screenshot 2023-02-14 at 10 11 13 AM

Feb 14 '23 15:02 SaxyPandaBear

my worry is that people who do have additional keys in their config already will suddenly have new dimensions appended to their metrics which would be a breaking change

My gut reaction is that if someone has configured an arbitrary key-value pair in their append dimensions and hasn't complained that they don't show up as dimensions on CW, they don't care either way.

This was what the original bug report #172 was requesting / complaining about so some people obviously do care about this (including my organisation...).

I'm a little less worried about this being a breaking change because it's meant for static strings, at least from a cost perspective. I can get confirmation to be sure, but I don't think that customers who do this should receive a spike in CW billing. If you start with [instanceID, autoScalingGroupName], there's already cardinality there so you're paying for unique metrics per instance ID and ASG name. Adding "foo=bar" to the dimension set is still going to produce the same number of metrics, just with another dimensions. At least that's what my gut tells me.

I'm very confident that you're correct that this won't introduce additional costs for clients. But that's not my main worry (see commentary below).

Back to the main point, I don't even really see an issue operationally. I just tested something out myself. I set up a config with just instance ID as the global append dimension:

snip...

Then I set up a dashboard widget for the mem used metric, and also set an alarm.

Afterwards, I added InstanceType as a dimension:

snip...

And monitored the alarm / dashboard widget. The metrics still flowed to the alarm and widget.

All you have done here is confirmed the current behaviour — that unsupported append_dimensions keys are silently dropped when the JSON configuration is converted to TOML. If we suddenly start treating arbitrary static key / value pairs in append_dimensions as dimensions to be added to all metrics, client's alarms / metric queries / search expressions will be broken because every metric will have a different set of dimensions to those included previously.

In your second example, your mem_used_percent metric would have another dimension of InstnaceType: ${aws:InstanceType} added to it, meaning that both the alarm configuration and the dashboard widget would be broken (dimensions have to match exactly for CloudWatch alarms — if you used a very specific type of search expression, your dashboard widget might be OK, but most likely it wouldn't be).

This is why I think we should consider using append_dimensions as the configuration key for these global dimensions as a breaking change (some client's monitoring will be broken when they upgrade to the latest version of the CloudWatch Agent).

If any of this doesn't make sense / seems wrong @SaxyPandaBear , please let me know!

Feb 15 '23 08:02 paulbrimicombe

🤦 I had a typo in the config so it didn't append the other dimension. That's totally my bad. Okay that clears it up. I'll talk with others internally. Don't think there's a good way to figure out who would be impacted by this if we were to introduce the breaking change

Feb 15 '23 14:02 SaxyPandaBear

This PR was marked stale due to lack of activity.

Feb 23 '23 00:02 github-actions[bot]

My gut reaction is that if someone has configured an arbitrary key-value pair in their append dimensions and hasn't complained that they don't show up as dimensions on CW, they don't care either way.

I'm agree with Andrew on this with using append_dimensions as a replace for global_dimensions. However, I'm not align with the implementations

Global dimensions should be overriden by plugin dimensions , not other way around though
The implementation shares the same mechanism with global_tags; however, only the dimension's order is different so I'm still not sure the reason why we are not leverage it? (e.g using global_tags with non:AWS)

If others agrees with it then I'm fine with it, but this would be not a good behavior when introducing to other exporters in the future.

Feb 24 '23 15:02 khanhntd

My gut reaction is that if someone has configured an arbitrary key-value pair in their append dimensions and hasn't complained that they don't show up as dimensions on CW, they don't care either way.

I'm agree with Andrew on this with using append_dimensions as a replace for global_dimensions

@khanhntd to me, as a client of this application, this would definitely be a breaking change. My production dashboards / alarms / monitoring etc. would be broken by a routine update to the CloudWatch Agent for the same input configuration.

Feb 27 '23 08:02 paulbrimicombe

The implementation shares the same mechanism with global_tags; however, only the dimension's order is different so I'm still not sure the reason why we are not leverage it? (e.g using global_tags with non:AWS)

This is interesting. I'd completely forgotten about the Telegraf global_tags option (I'd even used it a long time ago in a different project 🤦 ).

There seems to be some half-built(?) support for use of the built-in Telegraf global_tags (e.g. translator/translate/globaltags/globaltags.go) but it's untested, undocumented and doesn't seem to actually work with the current version of the agent (e.g. if I add a global_tags entry to the top level of the JSON config file, it never ends up in the TOML file — the service starts and creates the TOML file but something is filtering it out before it gets there).

If we do use global_tags instead of this PR, there are some things we need to decide / work out:

Where global_tags should appear in the JSON. Currently it looks like it's being picked up from the root of the JSON -- should it really be in metrics instead?
According to the Telegraf docs, "Global tags are overriden [sic] by tags set by plugins." — is this the behaviour we want?
Do we want to have deterministic behaviour wrt. global vs local tags when there are too many tags / dimensions? I don't know whether we have access to the information we'd need in the output plugin (will we have to do anything to pass through the information about which tags are global?)
How would we test that this works?

Feb 27 '23 13:02 paulbrimicombe

Where global_tags should appear in the JSON. Currently it looks like it's being picked up from the root of the JSON -- should it really be in metrics instead?

Yes we can put a translator for it. (e.g if you have global dimension in JSON, put it in global tags). Therefore, you can put the global_dimensions as part of metrics and global_tags are not being used based on my research (internal and external)

According to the Telegraf docs, "Global tags are overriden [sic] by tags set by plugins." — is this the behaviour we want?

IMO, yes. Plugin dimension should be prioritized over global plugin. Since if the customers are not setting at plugin level, then there is no reason why the customers would expect it to begin with.

Do we want to have deterministic behaviour wrt. global vs local tags when there are too many tags / dimensions? I don't know whether we have access to the information we'd need in the output plugin (will we have to do anything to pass through the information about which tags are global?)

I have the same thought of limiting the global tags to put some constraint on it. However, there are two ways to avoid that:

Using aggregating dimensions which will generate more metrics but shimming down the dimensions. However, increasing the cost for customers
Telegraf also has a feature of tagdrop or tagfilter to filter out all the unwanted tags and can be used to filter global tags if global tags + plugin tags are over thirdty.

The difference is defining the default behavior for agent and the intention on which perspective (customer or us).

How would we test that this works?

The most and easiest way that I could think off would be writing the toml config for agent (e.g

[global_tags]
  user = "${USER}"

[[inputs.mem]]

)

Feb 27 '23 16:02 khanhntd

The reason why I would bring up global_tags would be:

Firstly, it won't be specific to any output plugins
Secondly, support dimensions as part of traces and logs in the future
Thirdly, reduce all the functional code that we want to rely on CloudWatch

Let's me know what are your thoughts about this too !

Feb 27 '23 16:02 khanhntd

This PR was marked stale due to lack of activity.

Mar 07 '23 00:03 github-actions[bot]

IMO, yes. Plugin dimension should be prioritized over global plugin. Since if the customers are not setting at plugin level, then there is no reason why the customers would expect it to begin with.

Sorry but I'm not sure what you mean by "if the customers are not setting at plugin level, then there is no reason why the customers would expect it to begin with". Bear in mind that customers may well not be in control of the output of plugins in use (e.g. they are from third-party tools etc.). The original requests for global dimensions was to allow clients to add arbitrary dimensions to all metrics, which also wouldn't have been set at the plugin level.

Telegraf also has a feature of tagdrop or tagfilter to filter out all the unwanted tags and can be used to filter global tags if global tags + plugin tags are over thirdty.

How can we know at runtime if global tags + plugin tags are over thirty? As I understand it, the Telegraf config for things like tagdrop and tagfilter are static and so can't change depending on the number of tags for a specific metric etc.

Firstly, it won't be specific to any output plugins

This I feel may be a drawback in that it might have unintended consequences for other output plugins. The feature request is for CloudWatch metrics — are we sure that other output plugins would want the same global tags applied (now and in future)?

Mar 07 '23 13:03 paulbrimicombe

Pragmatically I think we need to try to actually come to some decision on this because it's turning out to be taking a long time.

Do we think that using global_dimensions is the right thing to use? That would mean:

The additional dimensions would always be applied to all output plugins
We have less fine-grained control over what happens when there are too many dimensions (e.g. deterministically choosing whether we prioritise plugin or global dimensions etc.)
I worry a bit about testing (esp. for the conditions when the dimensions overflow)

Or, do we think using something like the code in the PR is the right thing to use? This would mean:

Additional dimensions are only added to CloudWatch metrics
We have fine-grained control over what happens when the number of dimensions overflows
We can unit test lots of the additional code
There is more code / more to maintain

## Notes on testing global_tags

From the code, it looks like global_tags is supported.
I can add unit tests to (e.g. by adding it to the JSON and TOML files here https://github.com/aws/amazon-cloudwatch-agent/blob/main/translator/totomlconfig/toTomlConfig_test.go#L166-L168) that prove that the translation works correctly (there are some changes required to the tests to make sure they actually test this properly)
On a real copy of the agent, however, any global_tags I add to the JSON don't get migrated to the TOML on startup for some reason. They seem to get stripped out by something. Any idea why that would be?

Mar 07 '23 14:03 paulbrimicombe

Hi @paulbrimicombe , Will come back to the PR within today. Sorry for taking a long time to respond though

Mar 15 '23 13:03 khanhntd

This PR was marked stale due to lack of activity.

Mar 23 '23 00:03 github-actions[bot]

This PR was marked stale due to lack of activity.

Apr 19 '23 00:04 github-actions[bot]

Hello @khanhntd 5 months have passed, can you please share with us what is the opinion of the CloudWatch team on this issue and the solution proposed in this PR?

I think the limitation this PR is trying to address would already help a lot for the folks monitoring Ec2 tasks, since the dimensions you are forcing your users to use, might not make sense for some (most?) use cases.

But even more necessary for ECS or Serverless applications.... for which the current interface of the CloudWatch Agent does not allow ANY global dimensions to be set! See: https://github.com/aws/amazon-cloudwatch-agent/issues/864

Moreover, in theory, one could add a CloudWatch agent to his on-promise stack if he wants to use CloudWatch as a monitoring solution. So the CloudWatch Agent should allow some kind of generic configuration, agnostic of the stack used, and I think this PR goes exactly in this direction.

In general, when you look at the CloudWatch Agent interface you can really feel it was designed around EC2 which make sense historically, but AWS should now try to support and facilitate way more than that because of the portfolio of products it offers 🙏

Oct 10 '23 12:10 etiennechabert

This PR was marked stale due to lack of activity.

Nov 10 '23 00:11 github-actions[bot]

Forced to "up" so the PR does not stale... cc @khanhntd

Nov 10 '23 00:11 etiennechabert

amazon-cloudwatch-agent
amazon-cloudwatch-agent copied to clipboard

Support global metric dimensions

Description of the issue

Description of changes

License

Tests

Requirements

Codecov Report

amazon-cloudwatch-agent amazon-cloudwatch-agent copied to clipboard

Support global metric dimensions

Description of the issue

Description of changes

License

Tests

Requirements

Codecov Report

amazon-cloudwatch-agent
amazon-cloudwatch-agent copied to clipboard