tablesaw icon indicating copy to clipboard operation
tablesaw copied to clipboard

Add rolling window result centering

Open XePeleato opened this issue 3 years ago • 6 comments

Pandas-like center parameter; allows centering the result values instead of having a bunch of NaNs at the beginning of the column.

Thanks for contributing.

Description

It adds the option to center the result of a RollingColumn instead of having all the missing values at the beginning, like pandas does.

Testing

Did you add a unit test? Yes.

XePeleato avatar Mar 17 '21 20:03 XePeleato

so this puts n/a at front and back? I'm curious when you might use this

benmccann avatar Mar 17 '21 21:03 benmccann

(I deleted my old comment in order to write a more detailed one)

I use this centering when working with date(time)s indices so that the label is set to the midpoint of the rolling window, here's an example:

Let's say we have this Table

LocalDate base = LocalDate.of(2020, 01, 01);
DateColumn index = DateColumn.create("index", IntStream.range(0, 92).boxed().map(base::plusDays));
DoubleColumn data = DoubleColumn.create("data", IntStream.range(0, 92).asDoubleStream());

Table table = Table.create(index, data);
   index     |  data  |
-----------------------
 2020-01-01  |     0  |
 2020-01-02  |     1  |
 2020-01-03  |     2  |
 2020-01-04  |     3  |
 2020-01-05  |     4  |
        ...  |   ...  |
 2020-03-28  |    87  |
 2020-03-29  |    88  |
 2020-03-30  |    89  |
 2020-03-31  |    90  |
 2020-04-01  |    91  |

Now let's calculate a 7-Day rolling mean for this table as usual

table.addColumns(data.rolling(7).calc(AggregateFunctions.mean));

   index     |  data  |  data 7-period Mean  |
----------------------------------------------
 2020-01-01  |     0  |                      |
 2020-01-02  |     1  |                      |
 2020-01-03  |     2  |                      |
 2020-01-04  |     3  |                      |
 2020-01-05  |     4  |                      |
 2020-01-06  |     5  |                      |
 2020-01-07  |     6  |                   3  |
 2020-01-08  |     7  |                   4  |
 2020-01-09  |     8  |                   5  |
 2020-01-10  |     9  |                   6  |
        ...  |   ...  |                 ...  |
 2020-03-29  |    88  |                  85  |
 2020-03-30  |    89  |                  86  |
 2020-03-31  |    90  |                  87  |
 2020-04-01  |    91  |                  88  |

As you can see, the mean corresponding to the period 2020-01-01 to 2020-01-07 is set to the last value of the interval (2020-01-07). Now let's do the same centering the value:

   index     |  data  |  data 7-period Mean  |
----------------------------------------------
 2020-01-01  |     0  |                      |
 2020-01-02  |     1  |                      |
 2020-01-03  |     2  |                      |
 2020-01-04  |     3  |                   3  |
 2020-01-05  |     4  |                   4  |
 2020-01-06  |     5  |                   5  |
 2020-01-07  |     6  |                   6  |
        ...  |   ...  |                 ...  |
 2020-03-26  |    85  |                  85  |
 2020-03-27  |    86  |                  86  |
 2020-03-28  |    87  |                  87  |
 2020-03-29  |    88  |                  88  |
 2020-03-30  |    89  |                      |
 2020-03-31  |    90  |                      |
 2020-04-01  |    91  |                      |

Now the same value is set to the central point of the interval, this can be useful when working with dates, specially if this data is going to be subject to interpolation, where the results would be way more accurate this way.

I hope I made it clear enough, I'll be happy to answer any other question, Thank you for your time!

XePeleato avatar Mar 18 '21 00:03 XePeleato

Uhh. This PR looks messed up now. It says 79 files changed

benmccann avatar Apr 08 '21 16:04 benmccann

You're right, I mistakenly pushed other changes to master, it should be fine now.

XePeleato avatar Apr 11 '21 12:04 XePeleato

Can you give a more concrete use case? It's not clear to me why you might want to use this. In the example dataset you shared, you're utilizing data from the future, which makes it useless for prediction, so I'm not sure what you'd use it for. Perhaps there are use cases I'm not aware of though

benmccann avatar Apr 28 '21 22:04 benmccann

I'm personally not a big fan of adding more code complication to handle specific use cases that can already be handled rather simply. In this case you could use lead or lag with windowSize / 2

benmccann avatar May 04 '21 15:05 benmccann