telegraf Possible bug: With win_perf_counters some timestamps are late, causes graphs to show missing data

Possible bug: With win_perf_counters some timestamps are late, causes graphs to show missing data

Open AlexHeylin opened this issue 4 years ago • 8 comments

Issue: Possible bug / unexpected behaviour

Telegraf version: 1.14.2 (also 1.13.something) Inputs: win_perf_counters, also exec on some servers. Output: InfluxDB (production), file (testing)

Scenario

Telegraf is set to poll a number of win_perf_counters and send to InfluxDB every 10 seconds. Grafana is drawing graphs based on this data, some of those graphs use GROUP BY($Interval) and $interval = 10s

Symptom

First seen as every 70 seconds a data point was missing from all graphs for a host at some time periods. This doesn't always happen, and can occur on machines where this had been working fine before. Sometimes this is triggered by Telegraf service was restarted - then there is a chance this can happen. It has also suddenly started on machines without Telegraf being restarted. This affects all data from win_perf_counters.

Telegraf-InfluxDB-Grafana-01 Telegraf-InfluxDB-Grafana-02

Behaviour

Some data shown as missing on graphs, because the timestamp is later than it should be. This causes the "missing" data to fall over the boundary of the GROUP so you get a "missing" result, and the next result is made up (averaged) from two data points.

Expected behaviour

Graphs do not show data as missing, because data is written with timestamps which are 10 seconds apart (not 10 - 11 seconds apart).

Investigations done:

This doesn't just affect machines using exec input, so exclude that as cause. This can occur apparently without Telegraf service being restarted, however restarting service may cause this to occur on machines which were not previously showing this issue (was not evident on graphs)

Check timestamps on data written to Influx and file, and observe that sometimes data is written with a timestamp one second later than it should be. This seems particularly prevalent if the data is due on the 09 / 19 / 29 / 39 / 49 / 59th seconds of the minute. It may get written as x0 or occasionally x1 instead.

win_perf_counters should be reading the timestamps from the perf counters (default setting) so it's not clear how these can be out by a second. Possibly due to rounding to 1s accuracy.

InfluxDB queries which show this issue

Grafana

SELECT mean("Available_Bytes") as "Available Bytes", mean("Pool_Nonpaged_Bytes") as "Pool Nonpaged Bytes", mean("Pool_Paged_Bytes") as "Pool Paged Bytes" FROM "win_mem" WHERE host =~ /$hostname$/ AND $timeFilter GROUP BY time($interval), host ORDER BY asc

Simplest form

SELECT mean("Available_Bytes") as "Available Bytes" FROM "win_mem" WHERE host = 'MyHostName' AND time > now() - 1h and time < now() -10s GROUP BY time(10s)

Relevant settings:

Here's what I think is relevant. Can post full files if needed, but don't want to clutter the post.

[agent]
  interval = "300s"
  round_interval = false
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "0s"
  precision = ""

[[inputs.win_perf_counters]]
  interval = "10s"
  UseWildcardsExpansion=true

May 12 '20 12:05 AlexHeylin

telegraf telegraf copied to clipboard

Possible bug: With win_perf_counters some timestamps are late, causes graphs to show missing data

Issue: Possible bug / unexpected behaviour

Scenario

Symptom

Behaviour

Expected behaviour

Investigations done:

InfluxDB queries which show this issue

Grafana

Simplest form

Relevant settings:

telegraf
telegraf copied to clipboard