telegraf
telegraf copied to clipboard
Possible bug: With win_perf_counters some timestamps are late, causes graphs to show missing data
Issue: Possible bug / unexpected behaviour
Telegraf version: 1.14.2 (also 1.13.something) Inputs: win_perf_counters, also exec on some servers. Output: InfluxDB (production), file (testing)
Scenario
Telegraf is set to poll a number of win_perf_counters and send to InfluxDB every 10 seconds. Grafana is drawing graphs based on this data, some of those graphs use GROUP BY($Interval) and $interval = 10s
Symptom
First seen as every 70 seconds a data point was missing from all graphs for a host at some time periods. This doesn't always happen, and can occur on machines where this had been working fine before. Sometimes this is triggered by Telegraf service was restarted - then there is a chance this can happen. It has also suddenly started on machines without Telegraf being restarted. This affects all data from win_perf_counters.
Behaviour
Some data shown as missing on graphs, because the timestamp is later than it should be. This causes the "missing" data to fall over the boundary of the GROUP so you get a "missing" result, and the next result is made up (averaged) from two data points.
Expected behaviour
Graphs do not show data as missing, because data is written with timestamps which are 10 seconds apart (not 10 - 11 seconds apart).
Investigations done:
This doesn't just affect machines using exec input, so exclude that as cause. This can occur apparently without Telegraf service being restarted, however restarting service may cause this to occur on machines which were not previously showing this issue (was not evident on graphs)
Check timestamps on data written to Influx and file, and observe that sometimes data is written with a timestamp one second later than it should be. This seems particularly prevalent if the data is due on the 09 / 19 / 29 / 39 / 49 / 59th seconds of the minute. It may get written as x0 or occasionally x1 instead.
win_perf_counters should be reading the timestamps from the perf counters (default setting) so it's not clear how these can be out by a second. Possibly due to rounding to 1s accuracy.
InfluxDB queries which show this issue
Grafana
SELECT mean("Available_Bytes") as "Available Bytes", mean("Pool_Nonpaged_Bytes") as "Pool Nonpaged Bytes", mean("Pool_Paged_Bytes") as "Pool Paged Bytes" FROM "win_mem" WHERE host =~ /$hostname$/ AND $timeFilter GROUP BY time($interval), host ORDER BY asc
Simplest form
SELECT mean("Available_Bytes") as "Available Bytes" FROM "win_mem" WHERE host = 'MyHostName' AND time > now() - 1h and time < now() -10s GROUP BY time(10s)
Relevant settings:
Here's what I think is relevant. Can post full files if needed, but don't want to clutter the post.
[agent]
interval = "300s"
round_interval = false
collection_jitter = "0s"
flush_interval = "10s"
flush_jitter = "0s"
precision = ""
[[inputs.win_perf_counters]]
interval = "10s"
UseWildcardsExpansion=true