polars icon indicating copy to clipboard operation
polars copied to clipboard

Improve and expand Expr.rolling()

Open orlp opened this issue 2 years ago • 27 comments
trafficstars

@ritchie46 added Expr.rolling with the interface of DataFrame.rolling. I think we can improve on this and instead follow an earlier proposal I made in a comment. To replicate it here (with some simplifications/corrections):

I propose we split up Expr.rolling into two separate rolling functions, .rolling() and .rolling_intervals(). The former simply iterates over windows that have a fixed size in length (e.g. always 5 elements), whereas the latter iterates over windows that have a fixed interval size in value (e.g. an interval of 1 day, or an interval of floats) based on (another) column.


WindowPosition: TypeAlias = Literal["backward", "forward", "center"]

Expr.rolling(
    length: int,
    position: int | WindowPosition = "backward",
    *,
    min_samples: int = 0,
) → Self

This creates a sliding rolling window expression from an aggregate expression (one that maps a DataFrame to a Scalar, such as pl.col("a").sum()). This rolling window has a length, which is the number of elements in the window. Logically we position a window on each row in the DataFrame, and evaluate the aggregate expression on this window. The result thus has a length equal to the original DataFrame.

The window is positioned on each element according to position. I think an example is clearest here, with length = 5, with a window positioned on e:

             v
[a, b, c, d, e, f, g, h, i, k]

[a, b, c, d, e]                # backward
      [c, d, e, f, g]          # center
            [e, f, g, h, i]    # forward
               [f, g, h, i, k] # position = 1
         [d, e, f, g, h]       # position = -1

That is, forward starts the window at the given element, backward ends the window at the given element, center centers the window on the element, and an integer offset from the element itself can give you full control to position the window how you want otherwise (also allowing windows not including the element itself, for e.g. future aggregation).

For windows near the edge of the DataFrame nulls are passed instead:

Backwards window of length 5:
                      v
                  [a, b, c, d]
[null, null, null, a, b]

In case you don't want more than a given amount of nulls, you can use the min_samples parameter. min_samples = 0 allows any amount of nulls in the window, whereas min_samples = width allows no nulls in the window. In case the sample threshold is not met, the aggregation function is not evaluated and null is returned instead.


The second function, rolling_interval is very similar to the above, but with a crucial difference: it allows specifying a second column by which associates values with each row (e.g. timestamps). Then it will position the given window on each value, and find the other values which fall inside this window. Each matching row is included in the aggregation function and the result returned.

Expr.rolling_interval(
    by: Expr,
    length: int | float | timedelta | str,
    position: int | float | timedelta | str | WindowPosition = "backward",
    *,
    min_samples: int = 0,
    closed: ClosedInterval = "both",
) → Self

The length of a window is now not indices, but rather an actual amount, such as 100.0, or 1 day. The position is still as before, e.g. a forward window starts at the given time, a backwards window ends at the given time, or a specific time / value delta position places the start of the window at time + position. If you wish to look one day in the past and three days in the future you can use a length of 4 days with a position of -1 days.

This wouldn't be limited to timeseries either, you could do the following for example to compute the mean price of all similarly sized houses (those whose area has +/- 5 square meters):

pl.col("price").mean().rolling_interval("area", 10, "center").alias("similar_price")

Now as an interval has ambiguity on what to do on the endpoints this function also has a closed parameter which indicates if left, right, none or both of the intervals are closed.

A crucial difference between this function and rolling is that this one might have a (wildly) variable number of elements in a window. Nevertheless min_samples is still there to set a minimum number of elements, otherwise null will be returned for that window.

orlp avatar Oct 26 '23 13:10 orlp

This would also (trivially) resolve https://github.com/pola-rs/polars/issues/12014.

orlp avatar Oct 26 '23 13:10 orlp

This has my blessing. I think it is good to split the functionality of fixed and dynamic window sizes. And if we can make this work for any all expressions with specialization for the obvious ones sum, std, min, etc we have a very elegant, composable and fast API.

ritchie46 avatar Oct 26 '23 15:10 ritchie46

yeeeees very cool, this is exactly what I meant in my issue you linked! =)

A few things I would like to discuss:

Possibly merge rolling and rolling_interval?

For a smaller API surface it would be possible to merge rolling and rolling_interval into one function like pandas does. They have the same arguments except by which can be None by default (internally could still be different functions).

This would allow to do something like this:

df.rolling(3)...                      # fixed size window by rows; by=None by default
df.rolling(by='date', length='2d')... # 2 day dynamic size window using the date column
df.rolling('2d', by='date')...        # if we have `length` as first argument we could also do this

Add step argument to rolling

Just like pandas, it would be nice to have a step argument to rolling to create windows with a fixed size and fixed steps. This would be very useful for integer windows.

Examples:

# every 5 steps the sum of the last 10 rows
df.rolling(10, step=5).sum()

# something like a smoothing function (moving mean) for noisy signals that also reduces data size by a factor of 10
df.rolling(100, step=10).mean()

min_samples different from pandas min_periods?

I am not sure if the proposed min_samples should minic pandas min_periods but it seems different? Is this intended?

The pandas implementation works like this for min_periods=2 and window=5:

  • if there are at least 2 values in the 5 element window, drop all NaNs and calculate the result (usually interesting for bounds)
  • otherwise return NaN
  • default: min_periods=window -> return NaN if any NaN present

bring group_by_dynamic to the party :D

If we start this big implementation/refactoring I would suggest also renaming group_by_dynamic to something with rolling... to group all rolling functionality together.


Please let me know your thoughts on this. I am also happy to help with the implementation (code, documentation, examples, ...) where I can.

Julian-J-S avatar Oct 26 '23 16:10 Julian-J-S

Possibly merge rolling and rolling_interval?

I think this is a bad idea as you will have arguments that are only valid in the interval case and vice versa. They behave differently, why not make that explicit? A smaller API surface should be achieved by composabillity, not by merging functionality in a single method.

ritchie46 avatar Oct 26 '23 16:10 ritchie46

@JulianCologne

Possibly merge rolling and rolling_interval?

No. It's very much intended to separate these two functions as they take different parameters, behave differently, have different implementations, etc.

Add step argument to rolling

I would definitely be open to exploring that in the future, but I don't think it's necessary to land this feature right now, and would add complexity.

min_samples different from pandas min_periods?

I believe they're basically the same, except that pandas uses NaNs and we use nulls. What makes you believe that they're different?

Originally we did copy the min_periods name but ultimately @stinodego and I discussed and agreed that this name is a bit weird/vague, so we preferred min_samples.

bring group_by_dynamic to the party :D

It is our goal to, eventually, also do a pass like this to group_by_dynamic. However let's keep this limited in scope as to not overwhelm ourselves by trying to do everything at once.

orlp avatar Oct 26 '23 16:10 orlp

Great Feature! It can also resolve #12051

xyk2000 avatar Oct 26 '23 17:10 xyk2000

Possibly merge rolling and rolling_interval?

I think this is a bad idea as you will have arguments that are only valid in the interval case and vice versa. They behave differently, why not make that explicit? A smaller API surface should be achieved by composabillity, not by merging functionality in a single method.

Possibly merge rolling and rolling_interval?

No. It's very much intended to separate these two functions as they take different parameters, behave differently, have different implementations, etc.

haha okay, I have no problem with that ;) Was only an idea because pandas has them together which I find quite nice.

df.rolling(3).mean() # fixed length of 3
df.rolling('2d').max() # fixed duration of 2 days

The only difference would be by which could be None by default so it's not that different and the internal implementation would ofc be different. But 2 functions is completely fine, let's keep it like that.

I believe they're basically the same, except that pandas uses NaNs and we use nulls. What makes you believe that they're different?

Somehow though they sounded different when I first read it :D So if the current window has nonNullCount >= min_samples than the complete window values including possible null values is passed to the aggregation function, correct?

Julian-J-S avatar Oct 26 '23 17:10 Julian-J-S

So if the current window has nonNullCount >= min_samples than the complete window values including possible null values is passed to the aggregation function, correct?

Correct.

orlp avatar Oct 26 '23 17:10 orlp

I have recently been thinking a lot about all the rolling implementations and have been comparing them to different other implementations. I really love many of the awesome ideas in this issue and would like to suggest a few adjustments and would like to hear your thoughts on them.

Let me try to explain my reasoning using a simple 7 day centered rolling list example:

Starting point

df = pl.DataFrame({
    'date': pl.date_range(
        start=date(2020, 1, 1),
        end=date(2020, 1, 7),
        eager=True,
    ),
}).with_row_count('row')
┌─────┬────────────┐
│ row ┆ date       │
│ --- ┆ ---        │
│ u32 ┆ date       │
╞═════╪════════════╡
│ 0   ┆ 2020-01-01 │
│ 1   ┆ 2020-01-02 │
│ 2   ┆ 2020-01-03 │
│ 3   ┆ 2020-01-04 │
│ 4   ┆ 2020-01-05 │
│ 5   ┆ 2020-01-06 │
│ 6   ┆ 2020-01-07 │
└─────┴────────────┘

Desired result

┌────────────┬───────────────────────┐
│ date       ┆ rows                  │
│ ---        ┆ ---                   │
│ date       ┆ list[u32]             │
╞════════════╪═══════════════════════╡
│ 2020-01-01 ┆ [0, 1, 2, 3]          │
│ 2020-01-02 ┆ [0, 1, 2, 3, 4]       │
│ 2020-01-03 ┆ [0, 1, 2, 3, 4, 5]    │
│ 2020-01-04 ┆ [0, 1, 2, 3, 4, 5, 6] │
│ 2020-01-05 ┆ [1, 2, 3, 4, 5, 6]    │
│ 2020-01-06 ┆ [2, 3, 4, 5, 6]       │
│ 2020-01-07 ┆ [3, 4, 5, 6]          │
└────────────┴───────────────────────┘

Current implementation

(
    df
    .rolling(
        index_column='date',
        period='7d',
        offset='-4d',
    )
    .agg(
        rows=pl.col('row'),
    )
)

pros:

  • period of 7 days is very intuitive

cons:

  • offset of -4 days is not intuitive

This issue implementation

(
    df
    .rolling_interval(
        by='date',
        length='7d',
        position='center',
        closed='both',
    )
    .agg(
        rows=pl.col('row'),
    )
)

pros:

  • seemingly intuitive 7 days length and center position

cons:

  • is this actually correct? or will it go 3.5 days back and 3.5 days forward? somewhat unclear/ambiguous

My suggestion

(
    df
    .rolling_interval(
        by='date',
        preceding='3d',
        following='3d',
        closed='both',
    )
    .agg(
        rows=pl.col('row'),
    )
)

pros:

  • very intuitive 3 days preceding and 3 days following

My inspiration (duckdb / sql)

duckdb.sql('''
select
    date,
    list(row) over (
        order by date
        range between
            interval 3 days preceding and
            interval 3 days following
    ) as rows
from df
''')

Summary

proposed function signature:

df.rolling_interval(
    by: ...,
    preceding: ... = None,
    following: ... = None,
    closed: ...,
)

advantages:

  • very intuitive and clear
  • current row always included (this is what rolling should enforce)
  • "center": use equal amount of preceding and following
  • "backward": use only preceding; forward defaults to None/'0'
  • "forward": use only following; backward defaults to None/'0'
  • use any uneven combination of preceding/following (they might not be negative! This avoids windows outside the current row's by value which is an anti-pattern)

Note on the position parameter

  • I don't like that in the current proposal a position > 0 or position < length leads to windows completely outside the current row
  • I think this is an anti-pattern and should be avoided
  • windows functions should always have the current row/index/date either at the start/end or somewhere in the window but never outside the window

Julian-J-S avatar Nov 06 '23 21:11 Julian-J-S

Generally onboard

First, a question - in the first example,

             v
[a, b, c, d, e, f, g, h, i, k]

[a, b, c, d, e]                # backward
      [c, d, e, f, g]          # center
            [e, f, g, h, i]    # forward
               [f, g, h, i, k] # position = 1
         [f, g, h, i, k]       # position = -1

is the last row a typo? Should it have been

         [d, e, f, g, h]       # position = -1

?

Second, for rolling_interval, I think @JulianCologne 's suggestion about preceding and following is a good one. It avoids making false promises about length, and avoids awkwardness around t - 1month + 1month not necessarily round-tripping.

Regarding enforcing including the current row - I think that's orthogonal, could we keep that to a separate discussion please? It could be done (or not) with or without the current proposal

MarcoGorelli avatar Nov 07 '23 09:11 MarcoGorelli

@JulianCologne Enforcing that the current row is in the window is unnecessarily limiting. Being able to aggregate over historical or future records I think is valuable.

@MarcoGorelli Yes that is a typo, let me fix that.

@MarcoGorelli Unfortunately, it is not orthogonal. We did discuss preceding/following internally but considered it less flexible precisely because it always includes the current record.

and avoids awkwardness around t - 1month + 1month not necessarily round-tripping.

I don't see why that wouldn't roundtrip. The rolling_interval would simply work with durations, and does not round to whole days or anything like such. If you wish things to be rounded to dates you should do that before passing it into the by.

orlp avatar Nov 07 '23 14:11 orlp

An example of not round-tripping would be if you use _saturating and are at the end of a month:

2020-07-31 - '1mo_saturating' + '1mo_saturating'
= 2020-06-30 + '1mo_saturating'
= 2020-07-30

Does it have to include the current record? For example

    .rolling_interval(
        by='date',
        preceding='0d',  # the default
        following='3d',
        closed='right',
    )

would exclude the current record. If negative duration were accepted by preceding and following, then you could get arbitrarily further away from the current record, e.g. preceding='-1d', following='3d'


With length and position, if you were to write

    .rolling_interval(
        by='date',
        length='1mo',
        position='-1mo_saturating',
        closed='right',
    )

then if 'date' is 2022-07-31, could you please show what you'd expect the window to be?

MarcoGorelli avatar Nov 07 '23 14:11 MarcoGorelli

Enforcing that the current row is in the window is unnecessarily limiting. Being able to aggregate over historical or future records I think is valuable

@orlp Could you please give a concreate example / use case. I am honestly really curious to know. Currently I would argue that this an an anti-pattern and the core idea of rolling is to create groups surrounding or adjacent to the current value. But as @MarcoGorelli explained would still be possible to create windows outside the current row using negative preceding/following

@orlp Could you explain how this would work:

.rolling_interval(
    by='date',  # date: 2020-01-31
    length= '1w' | '1mo',
    position='center',
    closed='both',
)

some questions / uncertainties

  • how should it "center"/"split" 1w to the left/right of the current row? 3.5 days is not possible with Date. The correct solution would probably be to use length='6d' + closed='both' which is confusing for the user who wants a 7day window
  • how shoud one split 1mo? Just take the length of the month of the current row? What if it is not even and you would get 0.5 days?

I do not have a good solution/idea for these problems with the proposed api/parameters but my idea of using preceding and following (which many people know using sql) would fix these and might be more intuitive to use (and implement)

# 7 days centerd
.rolling_interval(
    by='date',
    preceding='3d',
    following='3d',
    closed='both',
)

# 1 week window starting 2 weeks in the past (window completey outside current row; imo anti-pattern but still possible)
.rolling_interval(
    by='date',
    preceding='2w',
    following='-1w',
    closed='both',
)

Julian-J-S avatar Nov 07 '23 15:11 Julian-J-S

@orlp Could you please give a concreate example / use case. I am honestly really curious to know.

@JulianCologne Here are some examples:

  • For each house on the market, find the average price if the house had 5-15 more m^2 of area based on other houses on the market.
  • For each day correlate the number of blog posts about Apple that day with the average AAPL stock price over a 7 day window exactly one month later.
  • Compute the average temperature for 7 days, exactly one year ago, and compute the difference with today.

As far as I'm concerned, allowing intervals that do not include the original value is non-negotiable, as it would be a completely arbitrary restriction disallowing efficient processing of useful queries like the above.


So I had a talk with @MarcoGorelli and we agreed that the current design of rolling_interval is problematic for supporting some calendar-based operations. For calendar dates a forwards window of length N might not be the same as a backwards window of length N starting N units later. Consider a window of 1 month ending at 30th of March 2001. This would be (2001-02-28, 2001-03-30], whereas a window of 1 month starting at the 28th of Februari 2001 would be [2001-02-28, 2001-03-28). Even disregarding the closedness of the intervals, the dates are different.

So I would like to propose the following changes:

  1. Separate the function of position and offset. I propose we enforce position as a required argument, defaulting to "backward".

  2. To still allow full flexibility, including intervals that do not include the current value, we add a new argument offset. If the window is a forward positioned the interval will be [x + offset, (x + offset) + length], if it is backward it will be [(x + offset) - length, x + offset], and if it is centered it will be [(x + offset) - length / 2, (x + offset) + length / 2]. To ease @JulianCologne's concerns, position = "center" will not be allowed for calendar-based lengths, as it is ill-defined.

    The length must be non-negative, but the offset may be negative.

  3. To make intervals more natural I propose we change the default parameter of closed to None, which would indicate a default value of "left" for position = "forward", a value of "right" for position = "backward", and a value of "both" for position = "center".

Expr.rolling_interval(
    by: Expr,
    length: int | float | timedelta | str,
    position: str = "backward",
    *,
    offset: int | float | timedelta | str = 0,
    closed: ClosedInterval = None,
    min_samples: int = 0,
) → Self

orlp avatar Nov 07 '23 17:11 orlp

  • For each house on the market, find the average price if the house had 5-15 more m^2 of area based on other houses on the market.
  • For each day correlate the number of blog posts about Apple that day with the average AAPL stock price over a 7 day window exactly one month later.
  • Compute the average temperature for 7 days, exactly one year ago, and compute the difference with today.

Nice! Great examples, thanks a lot!


  1. Separate the function of position and offset. I propose we enforce position as a required argument, defaulting to "backward".

Good idea!


I think there are 2 general ways to create windows

  • start (offset) + size (length) + alignment (position)
  • start + end

Both have pros and cons and are more or less ergonomic depending on the concrete use case. The quesiton is: does is make sense to have them both? Is it enough to have only one of then? Can they be combined?

One example of what start+end can do that offset+length+position cannot do is the following:

  • window: [t - 2 month, t + 1 month] for date 2023-01-31

start+end

  • start='-2mo', end='1mo': [2023-11-30, 2023-02-28] ("correct")

offset+length+position

  • offset='1mo', length='3mo', position='backward': [2023-11-28, 2023-02-28] ("wrong" start) but you could argue that this works:
  • offset='-2mo', length='3mo', position='forward': [2023-11-30, 2023-02-28] (correct!) but what about for 2023-04-30
  • offset='-2mo', length='3mo', position='forward': [2023-02-28, 2023-05-28] ("wrong" end)

This might be corner cases but still useful to know that there are use cases that cannot be implemented.

MAYBE it is possible to combine both ideas and add an optional end parameter (and rename offset to start)

Expr.rolling_interval(
    by: Expr,
    length: int | float | timedelta | str,
    position: str = "backward",
    *,
    start: int | float | timedelta | str = 0, # offset -> start
    end: ... = None,  # NEW
    closed: ClosedInterval = None,
    min_samples: int = 0,
) → Self

IF end is specified it will create a window from start to end otherwise it will use the start/length/position.


rolling questions:

  • how would "center" work with length of even values?
  • splitting position/offset also makes sense here, right?

rolling/rolling_interval questions:

  • does it make sense to rename position to direction/alignment
  • position sounds more like "WHERE" the window is and not "HOW" it is aligned?

In general I really like your proposed solution, it has some great ideas!

Julian-J-S avatar Nov 08 '23 09:11 Julian-J-S

@JulianCologne There is a subtle but large ergonomics issue with start/end: there is no sensible default value for closed. If used on a date column, closed="both" would result start="-1w", end=0 to be 8 days. closed="right" would fix that but would make start=0, end="1w" not include today.

One example of what start+end can do that offset+length+position cannot do is the following: window: [t - 2 month, t + 1 month] for date 2023-01-31

Honestly, what does a user even mean by the window [t - 2 month, t + 1 month]? What's the use case? I feel like you can't mark any particular behavior as 'correct', any answer that gets fairly close with a reasonable interpretation here is fair game. I personally don't consider this a strong argument in favor of start + end.

And start/end would make different things harder to express, like for example a backwards window of one week, one year ago. This would require start="1y - 1w", something we don't currently support.

MAYBE it is possible to combine both ideas and add an optional end parameter (and rename offset to start)

I really don't see how without introducing mutually exclusive parameters / really confusing behavior switching.

how would "center" work with length of even values?

It would specify in the docs that for even length values that either the first or second half will be 1 larger. That is, the docs will specify exactly which of the two it is, I just haven't decided yet what is more logical. Perhaps I will retroactively decide after writing the code to see what's programmatically more natural, because this point I see no reason to prefer one over the other.

splitting position/offset also makes sense here ([rolling]), right?

It probably does, yes. A forward window would be indices [offset, offset + len), a backward window indices (offset - len, offset], and a centered window probably [offset - len // 2, offset - len // 2 + len). This choice would thus mean centered even-sized windows would place the central point right before the element, e.g. a window of size 4:

             v
[a, b, c, d, e, f, g, h, i, k]
            | center
      [c, d, e, f]

does it make sense to rename position to direction/alignment

I did consider it. I don't really like direction because center isn't a direction. I'm unsure on alignment... perhaps. Will have to think about it some more.

orlp avatar Nov 08 '23 12:11 orlp

thanks @orlp for all the explanations.

One last discussion for now 😆 :

There is only one little quirk for me which is position='center'

On the one hand I love it because it is a great utility function. On the other hand I find a lot of ideas confusing:

For continuous data (float, datetime): e.g. area=100.0 (float):

  • length=10.0, position='backward' --> >90..=100
  • length=10.0, position='center' --> =95..=105
  • makes total sense

For discrete data (int, date): e.g. age=30 (int)

  • length=4, position='backward' --> >26..=30 --> [27, 28, 29, 30] (clear)

  • length=4, position='center' --> =28..=32 --> [28, 29, 30, 31, 32] ? Is this the expected result?

  • length=3, position='backward' --> >27..=30 --> [28, 29, 30] (clear)

  • length=3, position='center' --> ?? --> user might expect [29, 30, 31] 3 days (length) centered at 30

  • problem: might produce unexpected behaviour

    • "closed" is somewhat tricky for discrete data
    • the resulting values will always be 1 more than the length of the window!
    • '7d' centered will produce 8 days!
    • users enter the length but usually mean "how many days" they want

Julian-J-S avatar Nov 08 '23 13:11 Julian-J-S

@JulianCologne Perhaps the default should be closed="left" also for position="center". That way for discrete data the length is always respected by default. The user can always choose closed="center" if they know they have continuous data and it matters to them.

orlp avatar Nov 08 '23 14:11 orlp

@orlp honestly, I have no good idea atm. Need to think about that.

I am still wondering how the window gets "centered" for discrete data

  • age=30, length=4: is the intended window: age - length/2, age + length/2 --> 28-32 ?
  • age=30, length=3: how does this work: age - length/2, age + length/2 --> 28.5-31.5 ???
  • ALTERNATIVELY for discrete data only when centering you subtract 1 from length to account for the "center" position:
  • age=30, length=3: age - (length-1)/2, age + (length-1)/2 --> 29-31

As I said, I like the "center" idea but currently there does is no solution for centering discrete values (dates, ints) with the expected result, which is a core use case for "center", right? I am thinking like:

rolling_interval(
    by = 'date',
    length = '1w',
    position = "center",
)

Users might expect this to work because I can center 1w around a single date by taking 3 days before, 3 days after and the current one which is 7.

Julian-J-S avatar Nov 08 '23 14:11 Julian-J-S

@JulianCologne The only problem in the original proposal is the default choice of closed="both" for center. As I already mentioned for rolling, a centered window with discrete values will be the range offset - len // 2 to offset - len // 2 + len, which has exactly len elements if the range is half-open (either left or right).

Users might expect this to work because I can center 1w around a single date by taking 3 days before, 3 days after and the current one which is 7.

Also as I already said, position = "center" will not be supported with calendar-based lengths, like week or day. If a user wishes to center that they must either use an offset, or use a non-calendar based length, such as 7 * 24 hours.

orlp avatar Nov 08 '23 16:11 orlp

thanks @orlp

I actually think closed='both' is a good choice. With center I expect by default a "perfect" centering. Example: int age=30

  • length=4 & length=5 -> [28, 29, 30, 31, 32]
  • feels very natural. You can think of 4 just splits into 2 to the left and right. And 5 as an odd number splits 2 left&right and 1 for the center

with closed='left' this would feel awkward to call this "centered" around 30.

  • length=4 & length=5 -> [28, 29, 30, 31]
  • problem 1) not centered
  • problem 2) with ranges you normally never get fewer elements than the length (length 5 but only 4 elements. Either you get the same amount with closed='left/right' or 1 more when closed='both'

Julian-J-S avatar Nov 09 '23 07:11 Julian-J-S

@JulianCologne I think you misunderstand. length=4 for integers centered around 30 would indeed be [28, 29, 30, 31], but length=5 would be [28, 29, 30, 31, 32] with closed="left" as the default. Please read the definition carefully (flooring division for integral types):

[offset - length / 2, offset - length / 2 + len)

For length = 5 around 30 that would be 30 - 5/2 = 28 and 30 - 5/2 + 5 = 33 thus giving the range [28, 33) which includes 28, 29, 30, 31, 32.

Yes it's a bit "weird" that for even lengths and integral data the window isn't truly centered, but I think respecting the user's choice of window length is far more important than it being perfectly centered.

orlp avatar Nov 09 '23 07:11 orlp

Curious if this proposal would allow custom clocks? When using an rolling_interval, could you for example (lets say you have a column that are seconds from some epoch (as opposed to unix timestamp), lets say seconds from midnight, or if you have some other notion of clock (for example, you might want to create a clock based off the volume traded of an asset as opposed to time).

kszlim avatar Jan 19 '24 22:01 kszlim

@kszlim As long as you create a column with a timestamp associated for each value as per your clock, rolling_interval should do that.

orlp avatar Jan 22 '24 11:01 orlp

@kszlim As long as you create a column with a timestamp associated for each value as per your clock, rolling_interval should do that.

@orlp I'm not sure if you understand what I mean, imagine you have just two columns a and b, and you want a rolling window that's based off the value of b (which is a monotonically increasing number but specifically isn't a time datatype.)

Ie. you have a table a b 0 1 3 5 1 6 1 7 2 9 6 11

And you do:

pl.col("a").sum().rolling("b", period="3v").alias("value clock")

Would generate

a b value clock 0 1 0 3 5 3 1 6 4 1 7 5 2 9 4 6 11 8

Ie. The windows are sized by the value in your index column, but doesn't necessarily have to be of type timestamp.

Would be nice if it also would work if it worked on "b" described in terms of diffs instead of cum_sum.

kszlim avatar Jan 22 '24 17:01 kszlim

@kszlim Your example would be computed by pl.col("a").sum().rolling_interval("b", 3). rolling_interval would not require anything time-based, just a weakly monotonically increasing column of a type that you can add/subtract.

orlp avatar Jan 22 '24 18:01 orlp

Ah, awesome, thanks!

kszlim avatar Jan 22 '24 19:01 kszlim