basicstats aggregators plugin's `diff` and `non_negative_diff` provide misleading results
To my understanding, the basicstats aggregators plugin's diff and non_negative_diff provide at least misleading results. This might also apply to the rate stat which is computed based on diff.
The behaviour I expect from a diff is to provide the difference considering all measurements. Unfortunately the current implementation only provides diffs from the first to the last measurement of a timeframe that is shifted without overlapping (the last measurement of the previous window is not included). With A, B, C, D, E and F being measurements
A B C D E F
|<- period 1 ->|<- period 2 ->|<- period 3 ->|<- period 4 ->|<- period 5 ->|
we currently only get a diff_value for the changes from A to B, C to D and E to F, but we are missing the changes from B to C and D to E! Taking only the current period into account is fine / required for stats like sum, max and count, but I think it does not make sense for deriving stats like diff and rate.
When we implemented the diff stat, I did not realize that the basicstats plugin exclusively moves the sliding window for measurements to be considered.
If there are usecases for the current behaviour, I am very interested in learning about those. Otherwise this behaviour should be regarded as a bug.
To my current understanding I assume this plugin should not implement deriving stats like diff and rate as this is not how the concept of the plugin works. Maybe there should be an extra plugin providing those stats or the existing derivative aggregator plugin could be extended accordingly.
Relevant telegraf.conf:
[agent]
interval = "3s"
flush_interval = "3s"
omit_hostname = true
[[outputs.file]]
files = ["stdout"]
data_format = "influx"
[[aggregators.basicstats]]
period = "6s"
drop_original = false
stats = ["count","min","max","diff"]
[[inputs.exec]]
commands = [
"bash -c 'echo $(($RANDOM % 10))'"
]
data_format = "value"
System info:
telegraf version 1.19.3
Steps to reproduce:
start telegraf with the above config (telegraf --config telegraf.conf)
Actual behavior:
2021-09-03T09:25:02Z I! Starting Telegraf 1.19.3
2021-09-03T09:25:02Z I! Loaded inputs: exec
2021-09-03T09:25:02Z I! Loaded aggregators: basicstats
2021-09-03T09:25:02Z I! Loaded processors:
2021-09-03T09:25:02Z I! Loaded outputs: file
2021-09-03T09:25:02Z I! Tags enabled:
2021-09-03T09:25:02Z I! [agent] Config: Interval:3s, Quiet:false, Hostname:"", Flush Interval:3s
exec value=0i 1630661103000000000
exec value_count=1,value_min=0,value_max=0 1630661106000000000
exec value=7i 1630661106000000000
exec value=0i 1630661109000000000
exec value_count=2,value_min=0,value_max=7,value_diff=-7 1630661112000000000
exec value=3i 1630661112000000000
exec value=8i 1630661115000000000
exec value_diff=5,value_count=2,value_min=3,value_max=8 1630661118000000000
exec value=8i 1630661118000000000
exec value=5i 1630661121000000000
exec value_diff=-3,value_count=2,value_min=5,value_max=8 1630661124000000000
exec value=4i 1630661124000000000
exec value=9i 1630661127000000000
exec value_max=9,value_diff=5,value_count=2,value_min=4 1630661130000000000
exec value=5i 1630661130000000000
exec value=2i 1630661133000000000
exec value_count=2,value_min=2,value_max=5,value_diff=-3 1630661136000000000
exec value=8i 1630661136000000000
exec value=5i 1630661139000000000
^C2021-09-03T09:25:43Z I! [agent] Hang on, flushing any cached metrics before shutdown
exec value_count=2,value_min=5,value_max=8,value_diff=-3 1630661142000000000
exec value=3i 1630661142000000000
exec value_count=1,value_min=3,value_max=3 1630661144000000000
2021-09-03T09:25:43Z I! [agent] Stopping running outputs
Expected behavior:
Besides the provided value_diff entries, e.g. from 1630661106000000000 to 1630661109000000000 resulting in value_diff=-7 and from 1630661112000000000 to 1630661115000000000 resulting in value_diff=5 I would have expected additional value_diff entries, e.g. for the window from 1630661109000000000 to 1630661112000000000.
next steps: look into writing a unit test to demonstrate this behavior and then look into what would be required to fix it
@m0 looks like you landed these features originally and have discovered the odd behavior as well. Do you have any thoughts on what needs to change to correct the behavior?
@powersj I think there are two issues that need to be considered:
- remove the broken implementation from the basicstats plugin as to my understanding it seems to be beyond the plugin's scope
- find a better place to implement the diff / non_negative_diff feature (maybe it even needs a separate plugin)
PS: Sorry for the late response, the notification about your post got lost on my side. :disappointed: