Updated aggregation range works incorrectly when computer sleeps
Hello,
I have a telegraf config with an input and a basicstats aggregator. When the computer resumes from sleep, the aggregator timekeeping is left behind. The consequence is that aggregated stats are dropped as they don't fit in the current aggregation window. The problem is that aggregation window will remain in the past forever.
After setting debug = true in agent config, I find these messages in the logs (note time differences):
Jun 07 09:57:39 my-host telegraf[6871]: 2023-06-07T07:57:39Z D! [aggregators.basicstats] Metric is outside aggregation window; discarding. 2023-06-07 09:57:39.992533402 +0200 CEST: m: 2023-06-06 22:30:47.845521171 +0200 CEST m=+16170.033531705 e: 2023-06-06 22:31:17.845521171 +0200 CEST m=+16200.033531705 g: 0s
Jun 07 09:57:40 my-host telegraf[6871]: 2023-06-07T07:57:40Z D! [aggregators.basicstats] Updated aggregation range [2023-06-06 22:31:17.845521171 +0200 CEST m=+16200.033531705, 2023-06-06 22:31:47.845521171 +0200 CEST m=+16230.033531705]
Looking in the code, I suspect the issue comes from this line: https://github.com/influxdata/telegraf/blob/master/models/running_aggregator.go#L174
func (r *RunningAggregator) Push(acc telegraf.Accumulator) {
r.Lock()
defer r.Unlock()
since := r.periodEnd
until := r.periodEnd.Add(r.Config.Period) <---- Highlighted line
r.UpdateWindow(since, until)
...
... as the aggregation window is incremented each time with the period and I see nowhere something to accomodate for time drifts (such as we have during machine sleeps).
I think we should change the code to set until to something like now() + c.Config.Period, so that we accomodate for time drifting in one direction or another.
Thanks for your help on this, Catalin.
I think we should change the code to set until to something like now() + c.Config.Period, so that we accomodate for time drifting in one direction or another.
Unfortunately, changing that could create periods in time where we skip data during normal operations. For example, there could be a small period where time.now() is off by a few seconds from where the period end was.
I think what we could do is detect that time.now is more than a config.period away from the until (meaning we are already past the entire period) and update the until to be time.now() + r.Config.Period as you suggest.
@srebhan thoughts?
Hmmm I think we might want to do both. :-) My idea would be to create "bins" from the last configuration window until "now" and sort the metrics in there as during sleep we might have metrics in the pipe that are stuck right before the aggregator... Furthermore, it might be that we read metrics from the past that are lost otherwise, e.g. a tail from a network drive that was filled during sleep or a service endpoint with historic data... This way we will have a burst in the number of bins, but we "do the right thing" (tm) without loosing data...
Is there any progress on this issue ?
I have the case, that the telegraf runs on an edge device. This device has not the correct date/time after power on. It takes some time to synchronize the clock by NTP. But that leads to the condition, that telegraf starts with an old date/time and because of this, merging metrics also does not work correctly once the date/time is updated.
@JSchy65 no I don't think so. Happy to review a pull-request though.
Probably fixed this issue with the following commit: https://github.com/Schachi/telegraf-fork/commit/5bdf91ea062fc54a318e2154bff90010550f44da
Unfortunately, for juristical reasons, for my company it is not possible to sign the "Corporate Contributor License Agreement". Therefore I am not allowed to create a pull request.
All what I can do is to make the suggestion to use the above fix.
@Schachi for legal reasons we cannot just copy the code above if you did not sign the CLA. Maybe your company permits that you submit the PR as a private person?