flux icon indicating copy to clipboard operation
flux copied to clipboard

Change aggregateWindow _start and _stop to match window's start and stop

Open kennyjwilli opened this issue 4 years ago • 9 comments

When using the aggregateWindow function, it will set all the outputted records _start and _stop columns to the value set in the range function. This behavior differs from that of window. window sets the _startand _stop to the start and stop for the window for that particular record. This difference is confusing and doesn't seem like it should be the default.

I propose changing the _start and _stop values for the outputted columns to the start and stop of an individual row's window. An option can be added to aggregateWindow to allow the user to pass in the columns they want the final window function call to output its _start and _stop to.

kennyjwilli avatar Aug 19 '19 22:08 kennyjwilli

@kennyjwilli aggregateWindow() is a higher-level function that uses window(every:inf) as the last step to "group" everything into a single table. This resets _start and _stop values to be the bounds of the covered range. Here's the function definition:

aggregateWindow = (every, fn, column="_value", timeSrc="_stop", timeDst="_time", tables=<-) =>
  tables
    |> window(every:every)
    |> fn(column:column)
    |> duplicate(column:timeSrc, as:timeDst)
    |> window(every:inf, timeColumn:timeDst)

aggregateWindow() is primarily designed to downsample data. The behavior of using the bounds of the entire range covered by the query to set _start and _stop matches how data comes in with the from() function.

If you want to preserve the _start and _stop values of each window, you can define a custom aggregateWindow() alternative that uses group() instead of window(every:inf).

customAggregateWindow = (every, fn, column="_value", timeSrc="_stop", timeDst="_time", tables=<-) =>
  tables
    |> window(every:every)
    |> fn(column:column)
    |> duplicate(column:timeSrc, as:timeDst)
    |> group()

This will group all records into a single table and preserve all column values.

sanderson avatar Aug 19 '19 23:08 sanderson

Thanks @sanderson. The behavior of aggregateWindow is, IMO, not intuitive and unnecessary. Are there actual use cases to have _start and _stop be the same for every single row in the resulting table?

I was proposing to change the default behavior of aggregateWindow to place each row window's start and stop in the _start and _stop columns. Or, given that would be a breaking change, provide a parameter to pass to aggregateWindow that allows this functionality.

kennyjwilli avatar Aug 19 '19 23:08 kennyjwilli

In the use case of downsampling data that aggregateWindow() is designed for, the final values of _start and _stop don't really matter. What matters is that each row has a timestamp representing the window it's from. This is why _start or _stop gets duplicated as _time before window(every:inf).

This behavior is tailored to downsampling data. Do you have a use case where the values of _start and _stop do matter in the final output?

sanderson avatar Aug 19 '19 23:08 sanderson

Yes. It is useful in our application to know the size of the window that a particular row represents. If _start and _stop "don't really matter", I suggest changing them to have more useful information.

We can change our queries to do the window and aggregation manually but it's more convenient to use aggregateWindow, especially when pasting a query into Chronograf.

kennyjwilli avatar Aug 19 '19 23:08 kennyjwilli

I believe there is already a proposal to hide/remove the _start and _stop columns for input data and add them only when they're needed (when using window()). If looking for window size/duration, that's defined by aggregateWindow's every parameter. The difference between the _start and _stop values after windowing will always equal the duration set with every.

If you're looking to preserve the times of the first and last rows in each window, _start and _stop won't give you that. They represent the window bounds. You'd have to write a custom aggregate function to hold on to first and last timestamps.

sumWithTimestamps = (tables=<-) =>
  tables
    |> reduce(
      identity: {sum: 0.0, index: 0, firstTime: 2019-01-01, lastTime:2019-01-01},
      fn: (r, accumulator) => ({
        r with
        sum: r._value + accumulator.sum,
        index: accumulator.sum + 1,
        firstTime: if accumulator.index == 0 then r._time else accumulator.firstTime,
        lastTime: if accumulator.index != 0 then r._time else accumulator.lastTime,
      })
    )

sanderson avatar Aug 20 '19 15:08 sanderson

This behavior is tailored to downsampling data. Do you have a use case where the values of _start and _stop do matter in the final output?

That's actually required to later be able to join windowed data between different series.

andig avatar Aug 25 '19 16:08 andig

I think the wanted behavior described here is implemented by just calling window() by itself. aggregateWindow is specifically for the use case that involves graphing the data which requires everything to be in the same table. If the start and stop are useful, then calling window() |> count() will get you what you want. If you want to group them while keeping start and stop the same, you can call group() using the specific columns you want to group by and exclude _start and _stop from it.

jsternberg avatar Feb 01 '20 23:02 jsternberg

@sanderson , sorry for necro posting, but I was debugging issue with aggregateWindow behaviour and found your comment:

What matters is that each row has a timestamp representing the window it's from

And following implementation example:

|> duplicate(column:timeSrc, as:timeDst)

I agree with statement that each row need to have a timestamp represening the window it is from. Another way to look at it is that aggregating already aggregated data should make no change to the data at all, since each window has one data point on the window boundary.

Having said that implementation doesn't achieve this design goal. Time window is exclusive of stop, it is [start, stop). By setting _time to _stop of every window we move data point to the next window, not keeping it in the window it belong.

redbaron avatar Feb 02 '23 14:02 redbaron

you can use timeSrc parameter
https://docs.influxdata.com/flux/v0/stdlib/universe/aggregatewindow/#timesrc |> aggregateWindow(every: 1d, fn: mean, timeSrc: "_start")

umer-ilyas avatar Jan 22 '24 05:01 umer-ilyas