polars icon indicating copy to clipboard operation
polars copied to clipboard

First rows get ignored by `group_by_dynamic` when using `offset`

Open michelbl opened this issue 5 months ago • 4 comments

Checks

  • [X] I have checked that this issue has not already been reported.
  • [X] I have confirmed this bug exists on the latest version of Polars.

Reproducible example

from datetime import UTC, datetime, timedelta

import polars as pl

df = pl.DataFrame(
    data={
        "t": pl.Series(
            [
                datetime(2024, 3, 22, 3, 0, tzinfo=UTC),
                datetime(2024, 3, 22, 4, 0, tzinfo=UTC),
                datetime(2024, 3, 22, 5, 0, tzinfo=UTC),
                datetime(2024, 3, 22, 6, 0, tzinfo=UTC),
            ]
        ).dt.cast_time_unit("ms"),
        "v": [1, 10, 100, 1000],
    }
).set_sorted("t")

resampled = df.group_by_dynamic(
    index_column="t", every="1d", offset=timedelta(hours=5)
).agg(
    [
        pl.sum("v").alias("v"),
    ]
)

print(resampled)

Log output

I ran POLARS_VERBOSE=1 python polars_bug_report.py but did not see any output in stderr.

However on stdout I get the result of the aggregation:

shape: (1, 2)
┌─────────────────────────┬──────┐
│ t                       ┆ v    │
│ ---                     ┆ ---  │
│ datetime[ms, UTC]       ┆ i64  │
╞═════════════════════════╪══════╡
│ 2024-03-22 05:00:00 UTC ┆ 1100 │
└─────────────────────────┴──────┘

Issue description

When using the method group_by_dynamic with the parameter offset, if the first rows are before the given offset, they are ignored.

In my example, I use a daily aggregation with an offset of 5 hours. If the first lines are in the between midnight and 5AM, they are not counted in the result.

Expected behavior

Every row should count:

shape: (1, 2)
┌─────────────────────────┬──────┐
│ t                       ┆ v    │
│ ---                     ┆ ---  │
│ datetime[ms, UTC]       ┆ i64  │
╞═════════════════════════╪══════╡
│ 2024-03-21 05:00:00 UTC ┆   11 │
│ 2024-03-22 05:00:00 UTC ┆ 1100 │
└─────────────────────────┴──────┘

Installed versions

--------Version info---------
Polars:               0.20.16
Index type:           UInt32
Platform:             Linux-6.5.0-26-generic-x86_64-with-glibc2.35
Python:               3.11.4 (main, Jun 26 2023, 15:13:33) [GCC 11.3.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               2023.10.0
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           3.8.2
numpy:                1.26.2
openpyxl:             3.1.2
pandas:               2.1.3
pyarrow:              11.0.0
pydantic:             2.5.1
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           2.0.23
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>

michelbl avatar Mar 22 '24 16:03 michelbl