polars icon indicating copy to clipboard operation
polars copied to clipboard

dynamic_groupby decending/backwards in time

Open Dermotholmes opened this issue 2 years ago • 11 comments

Problem description

  • [x] I have checked that a similar issues does not exist.

Problem I'm looking to solve

I have some timeseries data. I'm looking to group this data up into time based intervals (in this example: 3 minutes), however, I want to create these groups working backwards in time.

For example, I want the groups to originate from the most recent row of data. This means a group would consider the last datapoint it's upper bound and include all previous rows within 3 minutes of that looking back in time. The next group continues onwards from there another 3 minutes back in time and so on.

My use case basically considers recent data more important so I want's to ensure data that might be orphaned (See items A and J in examples below) occurs at the beginning (chronologically speaking) instead of the end.

Current thinking

I'm currently approaching this by thinking I want to groupby_dynamic but in reverse/descending order.

My understanding is that groupby_dynamic only sets windows going forwards.

Example

Example dataframe:

┌─────────────────────┬───────┐
│ date                ┆ alpha │
│ ---                 ┆ ---   │
│ datetime[μs]        ┆ str   │
╞═════════════════════╪═══════╡
│ 2022-01-01 11:00:00 ┆ A     │
│ 2022-01-01 11:01:00 ┆ B     │
│ 2022-01-01 11:02:00 ┆ C     │
│ 2022-01-01 11:03:00 ┆ D     │
│ 2022-01-01 11:04:00 ┆ E     │
│ 2022-01-01 11:05:00 ┆ F     │
│ 2022-01-01 11:06:00 ┆ G     │
│ 2022-01-01 11:07:00 ┆ H     │
│ 2022-01-01 11:08:00 ┆ I     │
│ 2022-01-01 11:09:01 ┆ J     │
└─────────────────────┴───────┘
  • Note the seconds difference on the last row.

Currently applying groupby_dynamic to this produces:

df.groupby_dynamic('date', every='3m', closed='left', include_boundaries=True).agg([
    pl.all().exclude('date')
])
┌─────────────────────┬─────────────────────┬─────────────────────┬─────────────────┐
│ _lower_boundary     ┆ _upper_boundary     ┆ date                ┆ alpha           │
│ ---                 ┆ ---                 ┆ ---                 ┆ ---             │
│ datetime[μs]        ┆ datetime[μs]        ┆ datetime[μs]        ┆ list[str]       │
╞═════════════════════╪═════════════════════╪═════════════════════╪═════════════════╡
│ 2022-01-01 11:00:00 ┆ 2022-01-01 11:03:00 ┆ 2022-01-01 11:00:00 ┆ ["A", "B", "C"] │
│ 2022-01-01 11:03:00 ┆ 2022-01-01 11:06:00 ┆ 2022-01-01 11:03:00 ┆ ["D", "E", "F"] │
│ 2022-01-01 11:06:00 ┆ 2022-01-01 11:09:00 ┆ 2022-01-01 11:06:00 ┆ ["G", "H", "I"] │
│ 2022-01-01 11:09:00 ┆ 2022-01-01 11:12:00 ┆ 2022-01-01 11:09:00 ┆ ["J"]           │
└─────────────────────┴─────────────────────┴─────────────────────┴─────────────────┘

However my desired result is:

┌─────────────────────┬─────────────────────┬─────────────────────┬─────────────────┐
│ _lower_boundary     ┆ _upper_boundary     ┆ date                ┆ alpha           │
│ ---                 ┆ ---                 ┆ ---                 ┆ ---             │
│ datetime[μs]        ┆ datetime[μs]        ┆ datetime[μs]        ┆ list[str]       │
╞═════════════════════╪═════════════════════╪═════════════════════╪═════════════════╡
│ 2022-01-01 10:57:01 ┆ 2022-01-01 11:00:01 ┆ 2022-01-01 10:57:01 ┆ ["A"]           │
│ 2022-01-01 11:00:01 ┆ 2022-01-01 11:03:01 ┆ 2022-01-01 11:00:01 ┆ ["B", "C", "D"] │
│ 2022-01-01 11:03:01 ┆ 2022-01-01 11:06:01 ┆ 2022-01-01 11:03:01 ┆ ["E", "F", "G"] │
│ 2022-01-01 11:06:01 ┆ 2022-01-01 11:09:01 ┆ 2022-01-01 11:06:01 ┆ ["H", "I", "J"] │
└─────────────────────┴─────────────────────┴─────────────────────┴─────────────────┘

Note that in the desired result I have assumed closed=right

Key points

The key difference here is really that it's A that gets orphaned into its own group when spilling over the cutoff and the most recent items (H,I,J) within the interval are always in the last group.

Additional thoughts

  • I had thought about a workaround where I would calculate the appropriate start time based on checking the last datapoint, but it doesn't seem possible to supply an arbitrary start time to groupby_rolling as far as I can see. I'm guessing I can manage this with the offset option, but that also requires checking the first datapoint and then calculating an appropriate offset -so it's extra steps I would ideally like to avoid if possible and not as declarative.
  • I also intend to use the by= parameter of groupby_dynamic in the real world, so writing a helper function to do this for me is not obvious as I don't see how that can be done as I'm relying on the inner workings of the native groupby_dynamic function?

Dermotholmes avatar Feb 04 '23 07:02 Dermotholmes

Not related to your feature request - I'm just wondering if this type of problem can be solved with .join_asof()?

start = df.get_column("date").min()
end   = df.get_column("date").max()

start = start.replace(second=end.second)

intervals = (
   pl.date_range(start, end, "3m")
     .to_frame("date")
     .with_row_count()
)

(df.join_asof(intervals, on="date", strategy="forward")
   .groupby("row_nr")
   .agg_list())
shape: (4, 3)
┌────────┬─────────────────────────────────────┬─────────────────┐
│ row_nr | date                                | alpha           │
│ ---    | ---                                 | ---             │
│ u32    | list[datetime[μs]]                  | list[str]       │
╞════════╪═════════════════════════════════════╪═════════════════╡
│ 2      | [2022-01-01 11:04:00, 2022-01-01... | ["E", "F", "G"] │
│ 3      | [2022-01-01 11:07:00, 2022-01-01... | ["H", "I", "J"] │
│ 1      | [2022-01-01 11:01:00, 2022-01-01... | ["B", "C", "D"] │
│ 0      | [2022-01-01 11:00:00]               | ["A"]           │
└────────┴─────────────────────────────────────┴─────────────────┘

Just a note - if you also provide code to generate your examples - it makes it much easier - one possible approach:

import io
import polars as pl

csv = """
date,alpha
2022-01-01T11:00:00.000000,A
2022-01-01T11:01:00.000000,B
2022-01-01T11:02:00.000000,C
2022-01-01T11:03:00.000000,D
2022-01-01T11:04:00.000000,E
2022-01-01T11:05:00.000000,F
2022-01-01T11:06:00.000000,G
2022-01-01T11:07:00.000000,H
2022-01-01T11:08:00.000000,I
2022-01-01T11:09:01.000000,J
"""

df = pl.read_csv(io.StringIO(csv), parse_dates=True)

cmdlineluser avatar Feb 04 '23 11:02 cmdlineluser

@Dermotholmes I also think so, because polars expr like rolling_apply/rolling_mean all take a window backward, it's weird to take a window forward in groupby_rolling/groupby_dynamic.

drivenow avatar Feb 15 '23 15:02 drivenow

Thanks for your comment @drivenow, to avoid any confusion I thought I might visualise what I'm after as perhaps the "forwards/backwards" language may be misleading.

The real difference in this request is about how the dynamic windows are positioned. Right now we can position the windows based on the first data point in the time series data. However I'm wanting to position the dynamic windows based on the last datapoint - ideally in a declarative way similar to how it's done today with the start_by='datapoint' argument.

Artboard1 Note that by positioning the dynamic windows based on the last datapoint, individual windows capture different datapoints in a way that is more desirable for my use case (i.e. it optimises for always capturing the most recent datapoints within a given window).

Dermotholmes avatar Feb 15 '23 20:02 Dermotholmes

@Dermotholmes But It doens't work actually... the window is still positioned by first datapoint

use start by = 'datapoint'

import io
import polars as pl

csv = """
date,group,value
2022-01-01,A,1
2022-01-01,B,1
2022-01-02,A,2
2022-01-02,B,2
2022-01-03,A,3
2022-01-03,B,3
2022-01-04,A,4
2022-01-04,B,4
2022-01-05,A,5
2022-01-05,B,5
"""
df = pl.read_csv(io.StringIO(csv), parse_dates=True)

df.groupby_dynamic(index_column="date", every = '1d',period="3d", 
    offset='0d', closed="left", by = "group", start_by = "datapoint").agg([
        pl.col("value"), pl.col('date').min().alias("date_min"), pl.col("date").max().alias("date_max")])

the output is:

shape: (10, 5)
┌───────┬────────────┬───────────┬────────────┬────────────┐
│ group ┆ date       ┆ value     ┆ date_min   ┆ date_max   │
│ ---   ┆ ---        ┆ ---       ┆ ---        ┆ ---        │
│ str   ┆ date       ┆ list[i64] ┆ date       ┆ date       │
╞═══════╪════════════╪═══════════╪════════════╪════════════╡
│ A     ┆ 2022-01-01 ┆ [1, 2, 3] ┆ 2022-01-01 ┆ 2022-01-03 │
│ A     ┆ 2022-01-02 ┆ [2, 3, 4] ┆ 2022-01-02 ┆ 2022-01-04 │
│ A     ┆ 2022-01-03 ┆ [3, 4, 5] ┆ 2022-01-03 ┆ 2022-01-05 │
│ A     ┆ 2022-01-04 ┆ [4, 5]    ┆ 2022-01-04 ┆ 2022-01-05 │
│ ...   ┆ ...        ┆ ...       ┆ ...        ┆ ...        │
│ B     ┆ 2022-01-02 ┆ [2, 3, 4] ┆ 2022-01-02 ┆ 2022-01-04 │
│ B     ┆ 2022-01-03 ┆ [3, 4, 5] ┆ 2022-01-03 ┆ 2022-01-05 │
│ B     ┆ 2022-01-04 ┆ [4, 5]    ┆ 2022-01-04 ┆ 2022-01-05 │
│ B     ┆ 2022-01-05 ┆ [5]       ┆ 2022-01-05 ┆ 2022-01-05 │
└───────┴────────────┴───────────┴────────────┴────────────┘

use start_by = "window",

df.groupby_dynamic(index_column="date", every = '1d',period="3d", 
    offset='0d', closed="left", by = "group", start_by = "window").agg([
        pl.col("value"), pl.col('date').min().alias("date_min"), pl.col("date").max().alias("date_max")])

the output is:

shape: (10, 5)
┌───────┬────────────┬───────────┬────────────┬────────────┐
│ group ┆ date       ┆ value     ┆ date_min   ┆ date_max   │
│ ---   ┆ ---        ┆ ---       ┆ ---        ┆ ---        │
│ str   ┆ date       ┆ list[i64] ┆ date       ┆ date       │
╞═══════╪════════════╪═══════════╪════════════╪════════════╡
│ A     ┆ 2022-01-01 ┆ [1, 2, 3] ┆ 2022-01-01 ┆ 2022-01-03 │
│ A     ┆ 2022-01-02 ┆ [2, 3, 4] ┆ 2022-01-02 ┆ 2022-01-04 │
│ A     ┆ 2022-01-03 ┆ [3, 4, 5] ┆ 2022-01-03 ┆ 2022-01-05 │
│ A     ┆ 2022-01-04 ┆ [4, 5]    ┆ 2022-01-04 ┆ 2022-01-05 │
│ A     ┆ 2022-01-05 ┆ [5]       ┆ 2022-01-05 ┆ 2022-01-05 │
│ ...   ┆ ...        ┆ ...       ┆ ...        ┆ ...        │
│ B     ┆ 2022-01-02 ┆ [2, 3, 4] ┆ 2022-01-02 ┆ 2022-01-04 │
│ B     ┆ 2022-01-03 ┆ [3, 4, 5] ┆ 2022-01-03 ┆ 2022-01-05 │
│ B     ┆ 2022-01-04 ┆ [4, 5]    ┆ 2022-01-04 ┆ 2022-01-05 │
│ B     ┆ 2022-01-05 ┆ [5]       ┆ 2022-01-05 ┆ 2022-01-05 │
└───────┴────────────┴───────────┴────────────┴────────────┘

I want groupy_dynamic return the following result, how can I do it?

shape: (10, 5)
┌───────┬────────────┬───────────┬────────────┬────────────┐
│ group ┆ date       ┆ value     ┆ date_min   ┆ date_max   │
│ ---   ┆ ---        ┆ ---       ┆ ---        ┆ ---        │
│ str   ┆ date       ┆ list[i64] ┆ date       ┆ date       │
╞═══════╪════════════╪═══════════╪════════════╪════════════╡
│ A     ┆ 2022-01-01 ┆ [1,] ┆ 2022-01-01 ┆ 2022-01-01.     │
│ A     ┆ 2022-01-02 ┆ [1, 2] ┆ 2022-01-01┆ 2022-01-02     │
│ A     ┆ 2022-01-03 ┆ [1, 2, 3] ┆ 2022-01-01 ┆ 2022-01-03 │
│ A     ┆ 2022-01-04 ┆ [2, 3, 4] ┆ 2022-01-02 ┆ 2022-01-04 │
│ A     ┆ 2022-01-05 ┆ [3, 4, 5] ┆ 2022-01-03 ┆ 2022-01-05 │
│ ...   ┆ ...        ┆ ...       ┆ ...        ┆ ...        │
│ B     ┆ 2022-01-02 ┆ [1,2]   ┆ 2022-01-01 ┆ 2022-01-02   │
│ B     ┆ 2022-01-03 ┆ [1,2,3]  ┆ 2022-01-01 ┆ 2022-01-03  │
│ B     ┆ 2022-01-04 ┆ [2,3,4]  ┆ 2022-01-02 ┆ 2022-01-04  │
│ B     ┆ 2022-01-05 ┆ [3,4,5]   ┆ 2022-01-03┆ 2022-01-05  │
└───────┴────────────┴───────────┴────────────┴────────────┘

drivenow avatar Feb 17 '23 13:02 drivenow

@drivenow You can offset="-2d" but you'll also get [4, 5], [5]

cmdlineluser avatar Feb 19 '23 11:02 cmdlineluser

@cmdlineluser I do so, but the date column is not I want, I want 2022-01-03 with value [1, 2, 3], not 2022-01-01.

df.groupby_dynamic(index_column="date", every = '1d',period="3d", 
             offset='-2d', closed="left", by = "group", start_by = "window")
             .agg([pl.col("value"), pl.col('date').min().alias("date_min"), pl.col("date").max().alias("date_max")])
┌───────┬────────────┬───────────┬────────────┬────────────┐
│ group ┆ date       ┆ value     ┆ date_min   ┆ date_max   │
│ ---   ┆ ---        ┆ ---       ┆ ---        ┆ ---        │
│ str   ┆ date       ┆ list[i64] ┆ date       ┆ date       │
╞═══════╪════════════╪═══════════╪════════════╪════════════╡
│ A     ┆ 2021-12-30 ┆ [1]       ┆ 2022-01-01 ┆ 2022-01-01 │
│ A     ┆ 2021-12-31 ┆ [1, 2]    ┆ 2022-01-01 ┆ 2022-01-02 │
│ A     ┆ 2022-01-01 ┆ [1, 2, 3] ┆ 2022-01-01 ┆ 2022-01-03 │
│ A     ┆ 2022-01-02 ┆ [2, 3, 4] ┆ 2022-01-02 ┆ 2022-01-04 │
│ ...   ┆ ...        ┆ ...       ┆ ...        ┆ ...        │
│ B     ┆ 2022-01-02 ┆ [2, 3, 4] ┆ 2022-01-02 ┆ 2022-01-04 │
│ B     ┆ 2022-01-03 ┆ [3, 4, 5] ┆ 2022-01-03 ┆ 2022-01-05 │
│ B     ┆ 2022-01-04 ┆ [4, 5]    ┆ 2022-01-04 ┆ 2022-01-05 │
│ B     ┆ 2022-01-05 ┆ [5]       ┆ 2022-01-05 ┆ 2022-01-05 │
└───────┴────────────┴───────────┴────────────┴────────────┘

I think we could provider a param origin like pandas resample, to chose which timestamp to label the group?

>>> index = pd.date_range('1/1/2000', periods=9, freq='T')
>>> series = pd.Series(range(9), index=index)
>>> series
2000-01-01 00:00:00    0
2000-01-01 00:01:00    1
2000-01-01 00:02:00    2
2000-01-01 00:03:00    3
2000-01-01 00:04:00    4
2000-01-01 00:05:00    5
2000-01-01 00:06:00    6
2000-01-01 00:07:00    7
2000-01-01 00:08:00    8
Freq: T, dtype: int64
>>> series.resample('3T', origin = 'start').sum()
2000-01-01 00:00:00     3
2000-01-01 00:03:00    12
2000-01-01 00:06:00    21
Freq: 3T, dtype: int64
>>> series.resample('3T', origin = 'end').sum()
2000-01-01 00:02:00     3
2000-01-01 00:05:00    12
2000-01-01 00:08:00    21
Freq: 3T, dtype: int64

drivenow avatar Feb 27 '23 09:02 drivenow

Hey @Dermotholmes

you can get your desired output with offset='-2m59s':

In [15]: df = pl.from_repr("""\
    ...: ┌─────────────────────┬───────┐
    ...: │ date                ┆ alpha │
    ...: │ ---                 ┆ ---   │
    ...: │ datetime[μs]        ┆ str   │
    ...: ╞═════════════════════╪═══════╡
    ...: │ 2022-01-01 11:00:00 ┆ A     │
    ...: │ 2022-01-01 11:01:00 ┆ B     │
    ...: │ 2022-01-01 11:02:00 ┆ C     │
    ...: │ 2022-01-01 11:03:00 ┆ D     │
    ...: │ 2022-01-01 11:04:00 ┆ E     │
    ...: │ 2022-01-01 11:05:00 ┆ F     │
    ...: │ 2022-01-01 11:06:00 ┆ G     │
    ...: │ 2022-01-01 11:07:00 ┆ H     │
    ...: │ 2022-01-01 11:08:00 ┆ I     │
    ...: │ 2022-01-01 11:09:01 ┆ J     │
    ...: └─────────────────────┴───────┘
    ...: """).with_columns(pl.col('date').set_sorted())

In [16]: df.groupby_dynamic('date', every='3m', closed='right', include_boundaries=True, offset='-2m59s').agg(
    ...:     pl.all().exclude('date')
    ...: )
Out[16]:
shape: (4, 4)
┌─────────────────────┬─────────────────────┬─────────────────────┬─────────────────┐
│ _lower_boundary     ┆ _upper_boundary     ┆ date                ┆ alpha           │
│ ---                 ┆ ---                 ┆ ---                 ┆ ---             │
│ datetime[μs]        ┆ datetime[μs]        ┆ datetime[μs]        ┆ list[str]       │
╞═════════════════════╪═════════════════════╪═════════════════════╪═════════════════╡
│ 2022-01-01 10:57:01 ┆ 2022-01-01 11:00:01 ┆ 2022-01-01 10:57:01 ┆ ["A"]           │
│ 2022-01-01 11:00:01 ┆ 2022-01-01 11:03:01 ┆ 2022-01-01 11:00:01 ┆ ["B", "C", "D"] │
│ 2022-01-01 11:03:01 ┆ 2022-01-01 11:06:01 ┆ 2022-01-01 11:03:01 ┆ ["E", "F", "G"] │
│ 2022-01-01 11:06:01 ┆ 2022-01-01 11:09:01 ┆ 2022-01-01 11:06:01 ┆ ["H", "I", "J"] │
└─────────────────────┴─────────────────────┴─────────────────────┴─────────────────┘

MarcoGorelli avatar Jun 20 '23 10:06 MarcoGorelli

I think this matches your expected output exactly - not sure much else needs doing here, so closing for now - but thanks for the report!

MarcoGorelli avatar Jun 25 '23 18:06 MarcoGorelli

I also think it would be a useful feature to have built in. Often you want to include the most recent data - with the full window size - but don't care so much about truncating or partial windows at the start of the dataset.

I am currently using the following wrapper function to achieve "right alignment" using the offset trick, but it's a bit convoluted. Would be nice to just have an anchor or align parameter to say whether it's anchored to the first or last datapoint.

I would also like a parameter to exclude groups that contain only subsets of data in other groups. In the normal group_by_dynamic these are the last few windows (potentially). I've added that in the function below too.

def group_by_dynamic_right_aligned(
        df: DataFrame,
        index_column: IntoExpr,
        *,
        every: str | timedelta,
        period: str | timedelta | None = None,
        include_boundaries: bool = False,
        by: IntoExpr | Iterable[IntoExpr] | None = None,
        check_sorted: bool = True,
        include_windows_ending_after_last_index: bool = False
):
    """
    Wrapper for polars group_by_dynamic that aligns the windows such that the last window ends on the last date/datetime in the data.
    Consequently, the first window may have a shorter date range than the others.
    Windows are labelled by their right (end) date (inclusive).
    Set include_windows_ending_after_last_index=True to include windows that extend beyond the last date, therefore only contain subsets of the last full window.
    Refer to group_by_dynamic for other parameters.

    Example:

     .. code-block:: python
        df = pl.DataFrame(pl.date_range(date(2023,1,1), date(2023,1,12), interval='1d', eager=True))
        print(df)

        group_by_dynamic_right_aligned(df, 'date', every='3d', period='5d').agg(
            start=pl.col.date.min(),
            end=pl.col.date.max(),
            n=pl.col.date.count())
        )

    returns:

    ```
    ┌────────────┬────────────┬────────────┬─────┐
    │ date       ┆ start      ┆ end        ┆ n   │
    │ ---        ┆ ---        ┆ ---        ┆ --- │
    │ date       ┆ date       ┆ date       ┆ u32 │
    ╞════════════╪════════════╪════════════╪═════╡
    │ 2023-01-03 ┆ 2023-01-01 ┆ 2023-01-03 ┆ 3   │
    │ 2023-01-06 ┆ 2023-01-02 ┆ 2023-01-06 ┆ 5   │
    │ 2023-01-09 ┆ 2023-01-05 ┆ 2023-01-09 ┆ 5   │
    │ 2023-01-12 ┆ 2023-01-08 ┆ 2023-01-12 ┆ 5   │
    └────────────┴────────────┴────────────┴─────┘
    ```
    """
    # First pass at the groups, with no offset
    labels_series = (df
      .group_by_dynamic(index_column, every=every, period=period, include_boundaries=include_boundaries, by=by, check_sorted=check_sorted, closed='right', label='right', start_by='window')
      .agg()
      .select(index_column)
      .to_series(0)
    )

    max_date=df.select(index_column).to_series(0).max()
    end_of_first_window_extending_beyond_data=labels_series.filter(labels_series >= max_date)[0]
    # The negative offset to shift windows by such that the last window ends exactly on max_date
    offset=max_date - end_of_first_window_extending_beyond_data

    # Redo the group_by_dynamic with the offset, so we get a window ending exactly on max_date
    groups = df.group_by_dynamic(index_column, every=every, period=period, by=by, closed='right', label='right', start_by='window', offset=offset)

    if include_windows_ending_after_last_index:
        return groups
    
    # Monkey patch the agg function to filter out the groups that extend beyond the last date, which contain only subsets of the last full window
    def wrapped_agg(
        self,
        *aggs: IntoExpr | Iterable[IntoExpr],
        **named_aggs: IntoExpr,
    ) -> DataFrame:
        return groups.__class__.agg(self, *aggs, **named_aggs).filter(str_to_col(index_column) <= max_date)
    
    groups.agg = wrapped_agg.__get__(groups, groups.__class__)
    return groups

sebwills avatar Nov 24 '23 12:11 sebwills

return groups.__class__.agg(self, *aggs, **named_aggs).filter(str_to_col(index_column) <= max_date)

Thanks for sharing an implementation of this feature! ~~But what's the str_to_col function?~~ Edit: change str_to_col to pl.col

marcmk6 avatar Jan 07 '24 16:01 marcmk6

Oh, right, sorry. I have a str_to_col which (from memory) is something like

def str_to_col(column):
    return pl.col(column) if isinstance(column, str) else column

So the caller can either pass in column names, or Expr's. (Is there a way of handling that in polars already?)

sebwills avatar Jan 07 '24 16:01 sebwills

I vote for this enhancement 👍

mutecamel avatar Feb 18 '24 05:02 mutecamel

This would be an excellent feature to have.

I can do everything I want to as a LazyFrame, except group_by(..., order="reverse"), so I have to call .collect().

Right now my code looks like:

        clean_df = clean_df.sort(by="Date").collect()

        dfs = []
        for name, data in clean_df.group_by_dynamic(index_column="Date", every="1mo"):
            dfs.insert(0, data)

        for df in dfs:
            work_done = self.transform_month(df, symbol_data)
            if work_done:
                return work_done

        return False

In my case the reason why working in reverse helps is because often times you only need to update the most recent data. This is why the short circuit work_done exists.

Thankfully the data I'm processing is small enough I can afford to .collect() so it's not a huge issue on my end.

ddouglas87 avatar Jul 31 '24 10:07 ddouglas87