fix(aggregators): Handle time drift when calculating aggregation windows
Summary
MergeAggregator discarded metrics if the system clock was adjusted in meantime. Therefore metrics got lost. To prevent this, calculation of the merge window is now changed to take such time shifts into account.
Checklist
- [x] No AI generated code was used in this PR
Related issues
resolves #13403
@Schachi could you please rebase your PR on latest master to fix the CI tests?!
Thank you very much for your contribution @Schachi and sorry for the late reply! It took me some time to read into the issue so please let me try to understand what this is about. What I understood is the following...
You start Telegraf with the merge aggregator which in turn computes its aggregation window. Then, after some time, your machine goes to sleep (or otherwise adjusts the computer clock). Say the aggregation window is 12:13:00 to 12:13:30 when the computer hibernates at say 12:13:20. This aggregation window will be persisted in memory and if the machine wakes up at 20:45:31 it is still more than one hour behind. Telegraf will then never be able to catch up as the aggregation window will always advance with the fixed period. Is my understanding correct?
Yes, your understandig is correct. The example with the machine sleeping is from someone else. In our case, we have telegraf running on edge devices. If they are powered up, in very much cases, the system time is some hours behind the real time and dependend on the dynamic processes during startup, it will take some time before the clock is adjusted. If telegraf calculates the merge window after the clock is adjusted, all is fine. The merge aggregator works as expected and we get the merged messages. But if the telegraf calculated its merge window before the clock is adjusted, the merge windows is in the past "for ever". And because it is updated only relative to that point in past, it does not aggregate the current messages and therefore all this messages are lost in our case. The main idea of the fix is, that the merge window is always calculated based the time.now() to prevent a merge window that is in the past.
Gesendet: Montag, 24. Februar 2025 um 14:52 Von: "Sven Rebhan" @.> An: influxdata/telegraf @.> CC: Joachim @.>,Mention @.> Betreff: Re: [influxdata/telegraf] fix(aggregators.merge): Handle time drift when calculating aggregation windows (PR #16375)
Thank you very much for your contribution @Schachi and sorry for the late reply! It took me some time to read into the issue so please let me try to understand what this is about. What I understood is the following... You start Telegraf with the merge aggregator which in turn computes its aggregation window. Then, after some time, your machine goes to sleep (or otherwise adjusts the computer clock). Say the aggregation window is 12:13:00 to 12:13:30 when the computer hibernates at say 12:13:20. This aggregation window will be persisted in memory and if the machine wakes up at 20:45:31 it is still more than one hour behind. Telegraf will then never be able to catch up as the aggregation window will always advance with the fixed period. Is my understanding correct? —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.> srebhan left a comment (influxdata/telegraf#16375) Thank you very much for your contribution @Schachi and sorry for the late reply! It took me some time to read into the issue so please let me try to understand what this is about. What I understood is the following... You start Telegraf with the merge aggregator which in turn computes its aggregation window. Then, after some time, your machine goes to sleep (or otherwise adjusts the computer clock). Say the aggregation window is 12:13:00 to 12:13:30 when the computer hibernates at say 12:13:20. This aggregation window will be persisted in memory and if the machine wakes up at 20:45:31 it is still more than one hour behind. Telegraf will then never be able to catch up as the aggregation window will always advance with the fixed period. Is my understanding correct? —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.>
Good. So first of all I agree that we should handle those cases. My only concern is that with your current code, the aggregation window will grow in length if the clock is adjusted and the start-time will not be a multiple of the previous start time. So what we need to do is to compute the aggregation window as before. Then we detect if "now" (aka wall clock) is outside of the new aggregation window and adapt the new aggregation window such that "now" is in the window but since is a multiple of r.periodEnd + N*r.Config.Period with N being a whole number integer. Does that make sense?
@Schachi any chance you can work on this?
Download PR build artifacts for linux_amd64.tar.gz, darwin_arm64.tar.gz, and windows_amd64.zip. Downloads for additional architectures and packages are available below.
:relaxed: This pull request doesn't significantly change the Telegraf binary size (less than 1%)
:package: Click here to get additional PR build artifacts
Artifact URLs
Hi Sven,
sorry for the late response. I wasn't aware that you are waiting for some results from me.
I changed the running_aggregator.go as you suggested Mar 3. 2025 above and tested if the change works like expected. I started telegraf, changed time back and forward and checked if the aggregation window behaved as expected and the merged messages are still arriving:
- all aggregation windows are like expected
- the merge windows period (here: 10s) is preserved
- alle merged events received / no lose of events anymore
IMHO the issue is fixed
For details see the attached test report: test-report-cc2954b.txt
While reviewing, I was initially not clear about why we are using Truncate(-1) in this part of the code:
nowWall := time.Now().Truncate(-1)
I found the following explanation in the time package documentation, which clarifies it for me. Adding it here for future reference:
On some systems, the monotonic clock will stop if the computer goes to sleep. On such a system,
t.Sub(u)may not accurately reflect the actual time that passed betweentandu. The same applies to other functions and methods that subtract times, such as Since, Until, Time.Before, Time.After, Time.Add, Time.Equal, and Time.Compare. In some cases, you may need to strip the monotonic clock to get accurate results.
Additionally, Truncate(d) rounds t down to a multiple of d (since the zero time).
If
d <= 0,Truncatereturnststripped of any monotonic clock reading but otherwise unchanged.
This explains why Truncate(-1) is used in the code.