polars
polars copied to clipboard
Center parameter in rolling_sum doesn’t work with timedelta windows
What language are you using?
Python
Have you tried latest version of polars?
- [yes]
What version of polars are you using?
0.13.46
What operating system are you using polars on?
CentoOS 7
What language version are you using
python 3.9
Describe your bug.
While using rolling_sum function I would expect that ‘window by time’ and ‘window by index’ columns would produce identical results, but this does not seem to be the case
What are the steps to reproduce the behavior?
import polars as ps
from datetime import datetime
print(ps.__version__)
df = ps.DataFrame({
'dt': [datetime(2022, 1, 1, 1, 0), datetime(2022, 1, 1, 1, 1), datetime(2022, 1, 1, 1, 2), datetime(2022, 1, 1, 1, 3), datetime(2022, 1, 1, 1, 4)],
'val': [1, 1, 1, 1, 1],
})
print(df)
df_roll = df.select([
ps.col('val').rolling_sum(window_size='3m', by='dt', closed='both', center=True).alias('window by time'),
ps.col('val').rolling_sum(window_size='3i', center=True).alias('window by index'),
])
print(df_roll)
This is expected result. It is not window by 'index'` but by integer. That's the integer representation of the column.
Sorry for not being clear in my post It seems that 'center' parameter is not used in timedelta windows For example, I expected both first and last rows to be 2 when center=True
import polars as ps
from datetime import datetime
print(ps.__version__)
df = ps.DataFrame({
'dt': [datetime(2022, 1, 1, 1, 0), datetime(2022, 1, 1, 1, 1), datetime(2022, 1, 1, 1, 2), datetime(2022, 1, 1, 1, 3), datetime(2022, 1, 1, 1, 4)],
'val': [1, 1, 1, 1, 1],
})
print(df)
df_roll = df.select([
ps.col('val').rolling_sum(window_size='3m', by='dt', closed='both', center=True).alias('center True'),
ps.col('val').rolling_sum(window_size='3m', by='dt', closed='both', center=False).alias('center False'),
])
print(df_roll)
I agree this is not the correct behaviour, and this can be seen quite easily by comparing with non -timedelta windows. In the example, all datetimes are spaced 1 minute apart:
df = pl.DataFrame({
'dt': [datetime(2022, 1, 1, 1, 0), datetime(2022, 1, 1, 1, 1), datetime(2022, 1, 1, 1, 2), datetime(2022, 1, 1, 1, 3), datetime(2022, 1, 1, 1, 4)],
'val': [1, 1, 1, 1, 1],
})
i.e.
┌─────────────────────┬─────┐
│ dt ┆ val │
│ --- ┆ --- │
│ datetime[μs] ┆ i64 │
╞═════════════════════╪═════╡
│ 2022-01-01 01:00:00 ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 2022-01-01 01:01:00 ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 2022-01-01 01:02:00 ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 2022-01-01 01:03:00 ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 2022-01-01 01:04:00 ┆ 1 │
└─────────────────────┴─────┘
We can then define the following:
df_roll = df.select([
pl.col('val').rolling_sum(window_size='3m', by='dt', closed='both', center=True).alias('dt - center True'),
pl.col('val').rolling_sum(window_size='3m', by='dt', closed='both', center=False).alias('dt - center False'),
pl.col('val').rolling_sum(window_size=3, closed='both', center=True).alias('int - center True'),
pl.col('val').rolling_sum(window_size=3, closed='both', center=False).alias('int - center False'),
])
┌──────────────────┬───────────────────┬───────────────────┬────────────────────┐
│ dt - center True ┆ dt - center False ┆ int - center True ┆ int - center False │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 │
╞══════════════════╪═══════════════════╪═══════════════════╪════════════════════╡
│ 1 ┆ 1 ┆ null ┆ null │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2 ┆ 3 ┆ null │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 3 ┆ 3 ┆ 3 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ 4 ┆ 3 ┆ 3 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ 4 ┆ null ┆ 3 │
└──────────────────┴───────────────────┴───────────────────┴────────────────────┘
It should be noted that due to closed='both', the value will be four for the date syntax. However, there are still two inconsistencies:
- if
center=Truewould be respected, the first row should have a higher value for "dt - center True" than for "dt - center False", as it should not just look back, but also forward, and thus have a higher value. min_periodsseems to not be respected when the string syntax is being used forwindow_size. The default is to set it to be the same as thewindow_sizeaccording to the docs, and this works indeed for int. However, for strings:
df_roll2 = df.select([
pl.col('val').rolling_sum(window_size='3m', by='dt', closed='both', center=True).alias('default'),
pl.col('val').rolling_sum(window_size='3m', by='dt', closed='both', center=True, min_periods=3).alias('3'),
pl.col('val').rolling_sum(window_size='3m', by='dt', closed='both', center=True, min_periods=1).alias('1'),
])
┌─────────┬─────┬─────┐
│ default ┆ 3 ┆ 1 │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════════╪═════╪═════╡
│ 1 ┆ 1 ┆ 1 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 2 ┆ 2 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 3 ┆ 3 ┆ 3 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 4 ┆ 4 ┆ 4 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 4 ┆ 4 ┆ 4 │
└─────────┴─────┴─────┘
i.e. no impact at all.
This is still relevant. Here's a sample test case:
from datetime import datetime
import polars as pl
from polars.testing import assert_frame_equal
def test_rolling_sum_temporal_center() -> None:
df = pl.DataFrame(
{
"dt": [
datetime(2022, 1, 1, 1, 0),
datetime(2022, 1, 1, 1, 1),
datetime(2022, 1, 1, 1, 2),
datetime(2022, 1, 1, 1, 3),
datetime(2022, 1, 1, 1, 4),
],
"val": [1, 1, 1, 1, 1],
}
).set_sorted("dt")
result = df.select(
pl.col("val").rolling_sum(window_size="3m", by="dt", closed="both", center=True)
)
expected = pl.Series("val", [None, 3, 3, 3, None]).to_frame()
assert_frame_equal(result, expected)
hadn't noticed this one, thanks for the triage - time to investigate 🔬