vector icon indicating copy to clipboard operation
vector copied to clipboard

New `rolling` transform

Open a-rodin opened this issue 5 years ago • 8 comments

It would be nice to have a native transform which can perform rolling window calculations on events fields.

Use cases

One simple and common potential use case is combining it with fixed source (https://github.com/timberio/vector/issues/1317) to generate a sequence of events with increasing counter.

A more complex use case is doing simple smoothing or counting in Vector before sending the data to sinks, for example, computing moving averages with subsequent sampling.

Configuration

It would take a field and then calculate one of the following functions on the values of this field:

  • count - count number of events in which the source field was present;
  • sum - sum values of the source field;
  • average - calculate average of the values of the source field;
  • rms - calculate residual mean square (sqrt(average(x^2) - average(x)^2)).

The rolling window could take one of the following types

  • expanding - the window contains all timeseries points from the beginning;
  • exponential - the window weights timeseries points with exponential decay;
  • rectangular - the window uses either fixed number of points or fixed time interval.

The size of the window could be specified using one of the following configuration options:

  • window_points - specifies size of the window as number of events;
  • window_duration - specifies size of the window as time interval.

It should be possible to specify additional parameter window_field, which would:

  • for window_points use number of times the field window_field occured instead of total number of messages;
  • for window_duration use the field window_field as timestamp using which the duration is determined.

Additionally, there should be a Boolean parameter persistent, which would specify should the state of the window function stored on disk or not.

Example (common)

[transforms.rolling]
type = "rolling"
function = "count"
target_field = "count"

Example (advanced)

[transforms.rolling]
type = "rolling"
function = "average" # other values are `sum`, `average`, `rms`
window_type = "expanding" # other values are `exponential` and `rectangular`
window_time = 3
persistent = true # false by default

field = "message" # source field, optional for `count` function
target_field = "message_count" # optional if `field` is specified
drop_field = false # optional

References

Inspired by the rolling window API in Pandas.

a-rodin avatar Dec 08 '19 16:12 a-rodin

Nice, I really like this idea, would have lots of uses.

Jeffail avatar Dec 08 '19 17:12 Jeffail

Agree, this would be very useful!

I think the primary challenge is that the Transform API currently only lets you produce an output event in response to an input event. This makes it hard to do things like produce outputs on a regular interval, since you never know how long it will be before your next input event.

When I last thought about this, the best idea I had to was introduce a new type of transform internally that had an API more similiar to that of sources (i.e. it gets passed an output channel). This would require a bit of work in topology building, but would give lots of flexibility.

lukesteensen avatar Dec 09 '19 20:12 lukesteensen

Blocked by #1598.

binarylogic avatar Feb 13 '20 00:02 binarylogic

@binarylogic Hi Ben! I found the issue #1598 has been closed, is there any plan to implement this transform function? I post an issue #9231 which could be solved by this proposal. Many thanks!

xdatcloud avatar Sep 22 '21 08:09 xdatcloud

Would love to see this as well.

Telegraf had some support but it has a simple sequential pipeline (can't define how calculation flows or combines), it would be awesome if Vector could support such aggregations.

Looks like a duplicate of https://github.com/vectordotdev/vector/issues/3668

lambdaq avatar May 13 '22 03:05 lambdaq

👍 I think #3668 is slightly different in it is about aggregating metric events. This issue is about generating metrics from logs.

jszwedko avatar May 16 '22 22:05 jszwedko

@jszwedko Yeah but we do really need them both. Aggregate should work on logs as well as metrics. They are JSONs afterall

lambdaq avatar May 17 '22 03:05 lambdaq

Additional point of comparison: https://docs.fluentbit.io/manual/stream-processing/getting-started/fluent-bit-sql

jszwedko avatar Jan 06 '23 20:01 jszwedko