tablesaw
tablesaw copied to clipboard
Add rolling window result centering
Pandas-like center parameter; allows centering the result values instead of having a bunch of NaNs at the beginning of the column.
Thanks for contributing.
- [x] Tick to sign-off your agreement to the Developer Certificate of Origin (DCO) 1.1
Description
It adds the option to center the result of a RollingColumn instead of having all the missing values at the beginning, like pandas does.
Testing
Did you add a unit test? Yes.
so this puts n/a at front and back? I'm curious when you might use this
(I deleted my old comment in order to write a more detailed one)
I use this centering when working with date(time)s indices so that the label is set to the midpoint of the rolling window, here's an example:
Let's say we have this Table
LocalDate base = LocalDate.of(2020, 01, 01);
DateColumn index = DateColumn.create("index", IntStream.range(0, 92).boxed().map(base::plusDays));
DoubleColumn data = DoubleColumn.create("data", IntStream.range(0, 92).asDoubleStream());
Table table = Table.create(index, data);
index | data |
-----------------------
2020-01-01 | 0 |
2020-01-02 | 1 |
2020-01-03 | 2 |
2020-01-04 | 3 |
2020-01-05 | 4 |
... | ... |
2020-03-28 | 87 |
2020-03-29 | 88 |
2020-03-30 | 89 |
2020-03-31 | 90 |
2020-04-01 | 91 |
Now let's calculate a 7-Day rolling mean for this table as usual
table.addColumns(data.rolling(7).calc(AggregateFunctions.mean));
index | data | data 7-period Mean |
----------------------------------------------
2020-01-01 | 0 | |
2020-01-02 | 1 | |
2020-01-03 | 2 | |
2020-01-04 | 3 | |
2020-01-05 | 4 | |
2020-01-06 | 5 | |
2020-01-07 | 6 | 3 |
2020-01-08 | 7 | 4 |
2020-01-09 | 8 | 5 |
2020-01-10 | 9 | 6 |
... | ... | ... |
2020-03-29 | 88 | 85 |
2020-03-30 | 89 | 86 |
2020-03-31 | 90 | 87 |
2020-04-01 | 91 | 88 |
As you can see, the mean corresponding to the period 2020-01-01 to 2020-01-07 is set to the last value of the interval (2020-01-07). Now let's do the same centering the value:
index | data | data 7-period Mean |
----------------------------------------------
2020-01-01 | 0 | |
2020-01-02 | 1 | |
2020-01-03 | 2 | |
2020-01-04 | 3 | 3 |
2020-01-05 | 4 | 4 |
2020-01-06 | 5 | 5 |
2020-01-07 | 6 | 6 |
... | ... | ... |
2020-03-26 | 85 | 85 |
2020-03-27 | 86 | 86 |
2020-03-28 | 87 | 87 |
2020-03-29 | 88 | 88 |
2020-03-30 | 89 | |
2020-03-31 | 90 | |
2020-04-01 | 91 | |
Now the same value is set to the central point of the interval, this can be useful when working with dates, specially if this data is going to be subject to interpolation, where the results would be way more accurate this way.
I hope I made it clear enough, I'll be happy to answer any other question, Thank you for your time!
Uhh. This PR looks messed up now. It says 79 files changed
You're right, I mistakenly pushed other changes to master, it should be fine now.
Can you give a more concrete use case? It's not clear to me why you might want to use this. In the example dataset you shared, you're utilizing data from the future, which makes it useless for prediction, so I'm not sure what you'd use it for. Perhaps there are use cases I'm not aware of though
I'm personally not a big fan of adding more code complication to handle specific use cases that can already be handled rather simply. In this case you could use lead
or lag
with windowSize / 2