amazon-cloudwatch-agent
amazon-cloudwatch-agent copied to clipboard
collectd GAUGE NaN converted into a 0
Describe the bug When collectd network output plugin reports "nan" for a GAUGE value, CWA adds a 0-value metric data point.
Steps to reproduce I'm using the collectd tail input plugin to scrape for keywords. If the keyword does not appear in a 60s interval, the network output plugin reports a nan. If necessary, I can devise a degenerate test case for you.
What did you expect to see? Drop it, do not report a metric value for that interval.
What did you see instead? CloudWatch metric graph clearly shows a zero when collectd did not reported a nan (not a zero). This zero is fictitious.
What version did you use?
/opt/aws/amazon-cloudwatch-agent$ cat bin/CWAGENT_VERSION
1.247354.0b251981
What config did you use?
/opt/aws/amazon-cloudwatch-agent$ cat bin/config.json
{
"agent": {
"run_as_user": XXX,
"region": YYY,
"debug": true
},
"metrics": {
"metrics_collected": {
"collectd": {
"collectd_security_level": "none"
},
"disk": {
"measurement": [
"used_percent"
],
"resources": [
"/"
],
"drop_device": true
}
}
}
}
Environment
/opt/aws/amazon-cloudwatch-agent$ cat /etc/issue
Ubuntu 22.04 LTS \n \l
Additional context
I know that NaN is documented as not supported in CloudWatch: special values (for example, NaN, +Infinity, -Infinity) are not supported
https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_PutMetricData.html
However, NaN is Not a Number, which is most certainly != 0. I see that other plugins (prometheus_scraper for example) properly drop NaN.
~/Projects/amazon-cloudwatch-agent$ find . -type f -exec grep -H NaN {} \;
./plugins/inputs/prometheus_scraper/calculator.go: log.Printf("D! Drop metric with NaN or Inf value: %v", pm)
Why does collectd data not get the same treatment?
See for yourself in 60 seconds.
Different issue - amazon-cloudwatch-agent-ctl
does not work out of the box in docker. It needs
sed -ir 's/\\-\\.mount/tmp.mount/' /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl
FYI the top of the dockerfile attached above has its own instructions as:
# syntax = docker/dockerfile:1.4
#
# docker build:
# DOCKER_BUILDKIT=1 docker build . -f collectd_cwa_Dockerfile.txt -t collectd
#
# docker run:
# docker run -e REGION=[region] -e SECRET_ACCESS_KEY=[secret access key] -e ACCESS_KEY_ID=[access key id] -d --rm --name collectd collectd
#
# get the hostname, to find in CloudWatch -> Metrics -> All Metrics -> Browse -> Search, as "CWAgent > host, instance, type, type_instance"
# docker exec collectd hostname
#
# get collectd csv data (after running for > 60s):
# docker exec collectd find /var/lib/collectd -type f -exec cat {} \;
#
# get the output of the collectd network plugin (ctrl-c to quit):
# docker exec collectd tcpdump -i lo -n udp port 25826 -X
#
# get the CWA debug log (ctrl-c to quit):
# docker exec collectd tail -f /opt/aws/amazon-cloudwatch-agent/logs/amazon-cloudwatch-agent.log
This should be enough to get the zero's uploading to CloudWatch using whatever credentials you prefer. Of course the IAM permissions have to allow the relevant metric actions.
I've attempted to simplify the recreation of this issue as much as possible. Please let me know if there's anything else I can provide. If this can't or won't be fixed in the agent, I'll have to re-engineer the metrics I collect because it is critical for me to distinguish NaN (something did not happen) versus zero (something happened and experienced zero errors).
The main complication around NaN values is that the backend doesn't support it. https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_PutMetricData.html
Although the
Value
parameter accepts numbers of type Double, CloudWatch rejects values that are either too small or too large. Values must be in the range of -2^360 to 2^360. In addition, special values (for example, NaN, +Infinity, -Infinity) are not supported.
I understand NaN values are not supported in CloudWatch, so they should be dropped by the agent with no value uploaded. That way, alarms will treat this as missing data. https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html#alarms-and-missing-data
Sometimes, not every expected data point for a metric gets reported to CloudWatch. For example, this can happen when a connection is lost, a server goes down, or when a metric reports data only intermittently by design.
In my case "a metric reports data only intermittently by design." Currently, the NaN values are converted to zero and uploaded. NaN != 0
by definition. NaN = x
is always False
. https://en.wikipedia.org/wiki/NaN#Comparison_with_NaN
I used to use the collectd-cloudwatch plugin for collectd at https://github.com/awslabs/collectd-cloudwatch which worked correctly by dropping NaN. But that project has gone stale and was never remediated for python3. So I switched to the CWA and now my alarms are evergreen because of this NaN -> 0 "translation feature".
I took another look and it seems that for float64
this is already handled.
https://github.com/aws/amazon-cloudwatch-agent/blob/53040cdc24bfe683175fcec6d560b9e34136154b/internal/models/awscsm_pipeline.go#L90-L97
This issue was marked stale due to lack of activity.
Well, sounds like WONTFIX. Caveat emptor: basic math fail, unusable. Maintainer won't even tag this as a BUG. Looking first at migrating my fleet to netdata + kinesis, for anyone else who comes across this.
Hi @edlins , thanks for contacting us. Look at the problem now !
This issue was marked stale due to lack of activity.
We've fixed this as part of https://github.com/aws/amazon-cloudwatch-agent/pull/847, which has been released as of v1.300028.0. The agent will now drop unsupported values like NaN and +/- Inf.