flux
flux copied to clipboard
Change aggregateWindow _start and _stop to match window's start and stop
When using the aggregateWindow
function, it will set all the outputted records _start
and _stop
columns to the value set in the range
function. This behavior differs from that of window
. window
sets the _start
and _stop
to the start and stop for the window for that particular record. This difference is confusing and doesn't seem like it should be the default.
I propose changing the _start
and _stop
values for the outputted columns to the start and stop of an individual row's window. An option can be added to aggregateWindow
to allow the user to pass in the columns they want the final window
function call to output its _start
and _stop
to.
@kennyjwilli aggregateWindow()
is a higher-level function that uses window(every:inf)
as the last step to "group" everything into a single table. This resets _start
and _stop
values to be the bounds of the covered range. Here's the function definition:
aggregateWindow = (every, fn, column="_value", timeSrc="_stop", timeDst="_time", tables=<-) =>
tables
|> window(every:every)
|> fn(column:column)
|> duplicate(column:timeSrc, as:timeDst)
|> window(every:inf, timeColumn:timeDst)
aggregateWindow()
is primarily designed to downsample data. The behavior of using the bounds of the entire range covered by the query to set _start
and _stop
matches how data comes in with the from()
function.
If you want to preserve the _start
and _stop
values of each window, you can define a custom aggregateWindow()
alternative that uses group()
instead of window(every:inf)
.
customAggregateWindow = (every, fn, column="_value", timeSrc="_stop", timeDst="_time", tables=<-) =>
tables
|> window(every:every)
|> fn(column:column)
|> duplicate(column:timeSrc, as:timeDst)
|> group()
This will group all records into a single table and preserve all column values.
Thanks @sanderson. The behavior of aggregateWindow
is, IMO, not intuitive and unnecessary. Are there actual use cases to have _start
and _stop
be the same for every single row in the resulting table?
I was proposing to change the default behavior of aggregateWindow
to place each row window's start and stop in the _start
and _stop
columns. Or, given that would be a breaking change, provide a parameter to pass to aggregateWindow
that allows this functionality.
In the use case of downsampling data that aggregateWindow()
is designed for, the final values of _start
and _stop
don't really matter. What matters is that each row has a timestamp representing the window it's from. This is why _start
or _stop
gets duplicated as _time
before window(every:inf)
.
This behavior is tailored to downsampling data. Do you have a use case where the values of _start
and _stop
do matter in the final output?
Yes. It is useful in our application to know the size of the window that a particular row represents. If _start
and _stop
"don't really matter", I suggest changing them to have more useful information.
We can change our queries to do the window and aggregation manually but it's more convenient to use aggregateWindow
, especially when pasting a query into Chronograf.
I believe there is already a proposal to hide/remove the _start
and _stop
columns for input data and add them only when they're needed (when using window()
). If looking for window size/duration, that's defined by aggregateWindow
's every
parameter. The difference between the _start
and _stop
values after windowing will always equal the duration set with every
.
If you're looking to preserve the times of the first and last rows in each window, _start
and _stop
won't give you that. They represent the window bounds. You'd have to write a custom aggregate function to hold on to first and last timestamps.
sumWithTimestamps = (tables=<-) =>
tables
|> reduce(
identity: {sum: 0.0, index: 0, firstTime: 2019-01-01, lastTime:2019-01-01},
fn: (r, accumulator) => ({
r with
sum: r._value + accumulator.sum,
index: accumulator.sum + 1,
firstTime: if accumulator.index == 0 then r._time else accumulator.firstTime,
lastTime: if accumulator.index != 0 then r._time else accumulator.lastTime,
})
)
This behavior is tailored to downsampling data. Do you have a use case where the values of _start and _stop do matter in the final output?
That's actually required to later be able to join windowed data between different series.
I think the wanted behavior described here is implemented by just calling window()
by itself. aggregateWindow
is specifically for the use case that involves graphing the data which requires everything to be in the same table. If the start and stop are useful, then calling window() |> count()
will get you what you want. If you want to group them while keeping start and stop the same, you can call group()
using the specific columns you want to group by and exclude _start
and _stop
from it.
@sanderson , sorry for necro posting, but I was debugging issue with aggregateWindow
behaviour and found your comment:
What matters is that each row has a timestamp representing the window it's from
And following implementation example:
|> duplicate(column:timeSrc, as:timeDst)
I agree with statement that each row need to have a timestamp represening the window it is from. Another way to look at it is that aggregating already aggregated data should make no change to the data at all, since each window has one data point on the window boundary.
Having said that implementation doesn't achieve this design goal. Time window is exclusive of stop, it is [start, stop)
. By setting _time
to _stop
of every window we move data point to the next window, not keeping it in the window it belong.
you can use timeSrc
parameter
https://docs.influxdata.com/flux/v0/stdlib/universe/aggregatewindow/#timesrc
|> aggregateWindow(every: 1d, fn: mean, timeSrc: "_start")