polars icon indicating copy to clipboard operation
polars copied to clipboard

Center parameter in rolling_sum doesn’t work with timedelta windows

Open vstolin opened this issue 3 years ago • 3 comments
trafficstars

What language are you using?

Python

Have you tried latest version of polars?

  • [yes]

What version of polars are you using?

0.13.46

What operating system are you using polars on?

CentoOS 7

What language version are you using

python 3.9

Describe your bug.

While using rolling_sum function I would expect that ‘window by time’ and ‘window by index’ columns would produce identical results, but this does not seem to be the case

What are the steps to reproduce the behavior?

import polars as ps
from datetime import datetime
print(ps.__version__)
df = ps.DataFrame({
    'dt': [datetime(2022, 1, 1, 1, 0), datetime(2022, 1, 1, 1, 1), datetime(2022, 1, 1, 1, 2), datetime(2022, 1, 1, 1, 3), datetime(2022, 1, 1, 1, 4)],
    'val': [1, 1, 1, 1, 1],
})
print(df)
df_roll = df.select([
    ps.col('val').rolling_sum(window_size='3m', by='dt', closed='both', center=True).alias('window by time'),
    ps.col('val').rolling_sum(window_size='3i', center=True).alias('window by index'),
])
print(df_roll)

vstolin avatar Jun 16 '22 18:06 vstolin

This is expected result. It is not window by 'index'` but by integer. That's the integer representation of the column.

ritchie46 avatar Jun 17 '22 06:06 ritchie46

Sorry for not being clear in my post It seems that 'center' parameter is not used in timedelta windows For example, I expected both first and last rows to be 2 when center=True

import polars as ps
from datetime import datetime
print(ps.__version__)
df = ps.DataFrame({
    'dt': [datetime(2022, 1, 1, 1, 0), datetime(2022, 1, 1, 1, 1), datetime(2022, 1, 1, 1, 2), datetime(2022, 1, 1, 1, 3), datetime(2022, 1, 1, 1, 4)],
    'val': [1, 1, 1, 1, 1],
})
print(df)
df_roll = df.select([
    ps.col('val').rolling_sum(window_size='3m', by='dt', closed='both', center=True).alias('center True'),
    ps.col('val').rolling_sum(window_size='3m', by='dt', closed='both', center=False).alias('center False'),
])
print(df_roll)

vstolin avatar Jun 17 '22 18:06 vstolin

I agree this is not the correct behaviour, and this can be seen quite easily by comparing with non -timedelta windows. In the example, all datetimes are spaced 1 minute apart:

df = pl.DataFrame({
    'dt': [datetime(2022, 1, 1, 1, 0), datetime(2022, 1, 1, 1, 1), datetime(2022, 1, 1, 1, 2), datetime(2022, 1, 1, 1, 3), datetime(2022, 1, 1, 1, 4)],
    'val': [1, 1, 1, 1, 1],
})

i.e.

┌─────────────────────┬─────┐
│ dt                  ┆ val │
│ ---                 ┆ --- │
│ datetime[μs]        ┆ i64 │
╞═════════════════════╪═════╡
│ 2022-01-01 01:00:00 ┆ 1   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 2022-01-01 01:01:00 ┆ 1   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 2022-01-01 01:02:00 ┆ 1   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 2022-01-01 01:03:00 ┆ 1   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 2022-01-01 01:04:00 ┆ 1   │
└─────────────────────┴─────┘

We can then define the following:

df_roll = df.select([
    pl.col('val').rolling_sum(window_size='3m', by='dt', closed='both', center=True).alias('dt - center True'),
    pl.col('val').rolling_sum(window_size='3m', by='dt', closed='both', center=False).alias('dt - center False'),
    pl.col('val').rolling_sum(window_size=3, closed='both', center=True).alias('int - center True'),
    pl.col('val').rolling_sum(window_size=3, closed='both', center=False).alias('int - center False'),
   ])
┌──────────────────┬───────────────────┬───────────────────┬────────────────────┐
│ dt - center True ┆ dt - center False ┆ int - center True ┆ int - center False │
│ ---              ┆ ---               ┆ ---               ┆ ---                │
│ i64              ┆ i64               ┆ i64               ┆ i64                │
╞══════════════════╪═══════════════════╪═══════════════════╪════════════════════╡
│ 1                ┆ 1                 ┆ null              ┆ null               │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2                ┆ 2                 ┆ 3                 ┆ null               │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3                ┆ 3                 ┆ 3                 ┆ 3                  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4                ┆ 4                 ┆ 3                 ┆ 3                  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4                ┆ 4                 ┆ null              ┆ 3                  │
└──────────────────┴───────────────────┴───────────────────┴────────────────────┘

It should be noted that due to closed='both', the value will be four for the date syntax. However, there are still two inconsistencies:

  1. if center=Truewould be respected, the first row should have a higher value for "dt - center True" than for "dt - center False", as it should not just look back, but also forward, and thus have a higher value.
  2. min_periods seems to not be respected when the string syntax is being used for window_size. The default is to set it to be the same as the window_size according to the docs, and this works indeed for int. However, for strings:
df_roll2 = df.select([
    pl.col('val').rolling_sum(window_size='3m', by='dt', closed='both', center=True).alias('default'),
    pl.col('val').rolling_sum(window_size='3m', by='dt', closed='both', center=True, min_periods=3).alias('3'),
    pl.col('val').rolling_sum(window_size='3m', by='dt', closed='both', center=True, min_periods=1).alias('1'),
   ])
┌─────────┬─────┬─────┐
│ default ┆ 3   ┆ 1   │
│ ---     ┆ --- ┆ --- │
│ i64     ┆ i64 ┆ i64 │
╞═════════╪═════╪═════╡
│ 1       ┆ 1   ┆ 1   │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2       ┆ 2   ┆ 2   │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 3       ┆ 3   ┆ 3   │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 4       ┆ 4   ┆ 4   │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 4       ┆ 4   ┆ 4   │
└─────────┴─────┴─────┘

i.e. no impact at all.

zundertj avatar Jul 16 '22 17:07 zundertj

This is still relevant. Here's a sample test case:

from datetime import datetime

import polars as pl
from polars.testing import assert_frame_equal


def test_rolling_sum_temporal_center() -> None:
    df = pl.DataFrame(
        {
            "dt": [
                datetime(2022, 1, 1, 1, 0),
                datetime(2022, 1, 1, 1, 1),
                datetime(2022, 1, 1, 1, 2),
                datetime(2022, 1, 1, 1, 3),
                datetime(2022, 1, 1, 1, 4),
            ],
            "val": [1, 1, 1, 1, 1],
        }
    ).set_sorted("dt")

    result = df.select(
        pl.col("val").rolling_sum(window_size="3m", by="dt", closed="both", center=True)
    )
    expected = pl.Series("val", [None, 3, 3, 3, None]).to_frame()
    assert_frame_equal(result, expected)

stinodego avatar Oct 14 '23 13:10 stinodego

hadn't noticed this one, thanks for the triage - time to investigate 🔬

MarcoGorelli avatar Oct 14 '23 18:10 MarcoGorelli