kaskada icon indicating copy to clipboard operation
kaskada copied to clipboard

feat: support building lists from events

Open kerinin opened this issue 1 year ago • 2 comments

Building lists from series of values is a natural an useful operation. For example, if I have a stream of Foo events, I might want to combine them into a list of all the events to date:

Foo | concat()

Alternately, I might want to build a list of the 10 most recent:

Foo | concat(window=sliding(10, is_valid()))

Building a list from a time series allows users to operate over a portion of history as a single value. This has multiple potential applications:

  • Timeseries ML models
  • Graph visualizations of historical values
  • Aggregation across entities (although this may be better served with specialized syntax)

The function name used here (concat) may not be the best choice, but using the existing window and aggregation mechanisms seems like a good choice.

(note - depends on #378)

kerinin avatar May 24 '23 04:05 kerinin

I don't know that I like concat as an aggregation. Specifically, the default behavior (concatenate all the items) is decidedly degenerate. I think we may be better off with something like last_n(), which would take a constant n and produce a list of the last N values (it would basically correspond to list(lag(N), lag(N - 1), ..., lag(1), lag(0)). This would make it clearer how much memory was used.

It would still allow cases like feature | when(hourly()) | last_n(10) to get the value at the last 10 hours, or feature | when(is_valid(...)) | last_n(10) to get the last 10 values matching some predicate.

bjchambers avatar May 24 '23 16:05 bjchambers

My concern with last_n is the loss of generality - it's not possible to express an operation like "a list of all the url's viewed since the last purchase", or (to take the degenerate case) "the list of all badges earned by a user".

It would be nice if we could guarantee linear costs for any query over any dataset, but I think the usability sacrifices we'd have to make wouldn't be worth it. It's already possible to produce degenerate behavior with something like shift_until(false), and the duration-based windows we've considered elsewhere have similar risks (ie preceding(years(1000000))).

Another concern with last_n is that it introduces a new way of describing windows. For example, would we support last_n(5, window=sliding(10, ...))? What would this mean?

kerinin avatar May 24 '23 18:05 kerinin