tsibble
tsibble copied to clipboard
Experiment with roll_by family
slide_by(), stretch_by(), tile_by() which integrates with dplyr's mutate(), summarise(), etc.
While the data.frame method for these functions should operate on row indices, it is possible for the tbl_ts method to also support time-aware inputs such as durations (#130).
The main issue I’m concerning is the mable object once modeling the slided tsibble, model can’t be lazily evaluated.
I talked to Hadley about this. He thinks it's a potentially good approach, more like a lazily evaluated grouped data frame. He also mentioned the tidymodels bootstrapping approach https://davisvaughan.github.io/strapgod/articles/dplyr-support.html
That sounds promising.
There will need to be more work than I had hoped to get mutate(), filter(), etc. to behave as expected - so a new class with dplyr methods will be required.
While I think this functionality is useful, I want to wait to see more developments with slurrr (#143) before working on this further.
The original motivating issue with the window "id" being part of the key structure has now been resolved in fabletools by supporting disjoint hierarchies.
Further, if slurrr is the package location going further, this functionality should belong in that package. Maybe @DavisVaughan can consider it's implementation/use, and tsibble can later add a tbl_ts method that supports time aware sliding inputs.
oh, i think slide_by should stay in tsibble, no conflicts with slurrr.
On Thu, 15 Aug 2019 at 16:31, mitchelloharawild [email protected] wrote:
While I think this functionality is useful, I want to wait to see more developments with slurrr (#143 https://github.com/tidyverts/tsibble/issues/143) before working on this further.
The original motivating issue with the window "id" being part of the key structure has now been resolved in fabletools by supporting disjoint hierarchies.
Further, if slurrr is the package location going further, this functionality should belong in that package. Maybe @DavisVaughan https://github.com/DavisVaughan can consider it's implementation/use, and tsibble can later add a tbl_ts method that supports time aware sliding inputs.
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/tidyverts/tsibble/issues/137?email_source=notifications&email_token=AASXVGE6RBGVGM7EEBPUILTQETZ2NA5CNFSM4IHAT4LKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4K67YA#issuecomment-521531360, or mute the thread https://github.com/notifications/unsubscribe-auth/AASXVGFXDY7VRWQAOYWLZKLQETZ2NANCNFSM4IHAT4LA .
Oooh I had an idea for this too yesterday. It does use the lazy grouping feature of dplyr, but would require a new data frame subclass.
I'm not quite convinced that it would be worth it over a version of slide() that has an .index argument, which I mention in #143. They would essentially compute the same thing, but the size of the slide() result would be vec_size(.x) and the size of the slide_by result would be vec_size(unique(.index)).
I think it would be pretty hard (not impossible) to get another df subclass that would be able to 1) add and remove virtual groups correctly 2) be able to subclass on top of tsibble or a data frame, and know how to drop back to a pure tsibble after some kind of summary operation where the virtual groups are materialized. I think its much "easier" to let slide() be unaware of dplyr, but still do what you'd want here.
This is just a partial example of how it might look.
> pedestrian_small
# A tsibble: 240 x 6 [1h] <Australia/Melbourne>
# Key: Sensor [1]
Sensor Date_Time Date Time Count ym
<chr> <dttm> <date> <int> <int> <mth>
1 Birrarung Marr 2015-01-01 00:00:00 2015-01-01 0 1630 2015 Jan
2 Birrarung Marr 2015-01-01 01:00:00 2015-01-01 1 826 2015 Jan
3 Birrarung Marr 2015-01-01 02:00:00 2015-01-01 2 567 2015 Jan
4 Birrarung Marr 2015-01-01 03:00:00 2015-01-01 3 264 2015 Jan
5 Birrarung Marr 2015-01-01 04:00:00 2015-01-01 4 139 2015 Jan
6 Birrarung Marr 2015-01-01 05:00:00 2015-01-01 5 77 2015 Jan
7 Birrarung Marr 2015-01-01 06:00:00 2015-01-01 6 44 2015 Jan
8 Birrarung Marr 2015-01-01 07:00:00 2015-01-01 7 56 2015 Jan
9 Birrarung Marr 2015-01-01 08:00:00 2015-01-01 8 113 2015 Jan
10 Birrarung Marr 2015-01-01 09:00:00 2015-01-01 9 166 2015 Jan
# … with 230 more rows
>
> unique(pedestrian_small$ym)
[1] "2015 Jan" "2015 Feb" "2015 Mar" "2015 Apr" "2015 May" "2015 Jun" "2015 Jul" "2015 Aug" "2015 Sep" "2015 Oct"
[11] "2015 Nov" "2015 Dec" "2016 Jan" "2016 Feb" "2016 Mar" "2016 Apr" "2016 May" "2016 Jun" "2016 Jul" "2016 Aug"
[21] "2016 Sep" "2016 Oct" "2016 Nov" "2016 Dec"
>
> pedestrian_small_slide_grouped <- slide_by(pedestrian_small, ym, before = 1, after = 1, step = 1)
> pedestrian_small_slide_grouped
# A tibble: 240 x 6
# Groups: .start, .stop [24]
Sensor Date_Time Date Time Count ym
<chr> <dttm> <date> <int> <int> <mth>
1 Birrarung Marr 2015-01-01 00:00:00 2015-01-01 0 1630 2015 Jan
2 Birrarung Marr 2015-01-01 01:00:00 2015-01-01 1 826 2015 Jan
3 Birrarung Marr 2015-01-01 02:00:00 2015-01-01 2 567 2015 Jan
4 Birrarung Marr 2015-01-01 03:00:00 2015-01-01 3 264 2015 Jan
5 Birrarung Marr 2015-01-01 04:00:00 2015-01-01 4 139 2015 Jan
6 Birrarung Marr 2015-01-01 05:00:00 2015-01-01 5 77 2015 Jan
7 Birrarung Marr 2015-01-01 06:00:00 2015-01-01 6 44 2015 Jan
8 Birrarung Marr 2015-01-01 07:00:00 2015-01-01 7 56 2015 Jan
9 Birrarung Marr 2015-01-01 08:00:00 2015-01-01 8 113 2015 Jan
10 Birrarung Marr 2015-01-01 09:00:00 2015-01-01 9 166 2015 Jan
# … with 230 more rows
>
> pedestrian_small_slide_grouped %>%
+ summarise(
+ ym_start = first(ym),
+ ym_end = last(ym),
+ MonthlyCount = mean(Count)
+ )
# A tibble: 24 x 5
# Groups: .start [23]
.start .stop ym_start ym_end MonthlyCount
<date> <date> <mth> <mth> <dbl>
1 2015-01-01 2015-02-01 2015 Jan 2015 Feb 227.
2 2015-01-01 2015-03-01 2015 Jan 2015 Mar 180.
3 2015-02-01 2015-04-01 2015 Feb 2015 Apr 112.
4 2015-03-01 2015-05-01 2015 Mar 2015 May 147.
5 2015-04-01 2015-06-01 2015 Apr 2015 Jun 165.
6 2015-05-01 2015-07-01 2015 May 2015 Jul 160.
7 2015-06-01 2015-08-01 2015 Jun 2015 Aug 120.
8 2015-07-01 2015-09-01 2015 Jul 2015 Sep 130.
9 2015-08-01 2015-10-01 2015 Aug 2015 Oct 141
10 2015-09-01 2015-11-01 2015 Sep 2015 Nov 168.
# … with 14 more rows
>
> pedestrian_small_slide_along <- mutate(
+ pedestrian_small,
+ col = slide_along_impl(Count, ~mean(.x), .index = ym, .ptype = dbl(), .before = 1, .after = 1)
+ )
>
> pedestrian_small_slide_along
# A tsibble: 240 x 7 [1h] <Australia/Melbourne>
# Key: Sensor [1]
Sensor Date_Time Date Time Count ym col
<chr> <dttm> <date> <int> <int> <mth> <dbl>
1 Birrarung Marr 2015-01-01 00:00:00 2015-01-01 0 1630 2015 Jan 227.
2 Birrarung Marr 2015-01-01 01:00:00 2015-01-01 1 826 2015 Jan 227.
3 Birrarung Marr 2015-01-01 02:00:00 2015-01-01 2 567 2015 Jan 227.
4 Birrarung Marr 2015-01-01 03:00:00 2015-01-01 3 264 2015 Jan 227.
5 Birrarung Marr 2015-01-01 04:00:00 2015-01-01 4 139 2015 Jan 227.
6 Birrarung Marr 2015-01-01 05:00:00 2015-01-01 5 77 2015 Jan 227.
7 Birrarung Marr 2015-01-01 06:00:00 2015-01-01 6 44 2015 Jan 227.
8 Birrarung Marr 2015-01-01 07:00:00 2015-01-01 7 56 2015 Jan 227.
9 Birrarung Marr 2015-01-01 08:00:00 2015-01-01 8 113 2015 Jan 227.
10 Birrarung Marr 2015-01-01 09:00:00 2015-01-01 9 166 2015 Jan 227.
# … with 230 more rows
>
> # group by ym and slice(1) to get it down to the right number of rows
> pedestrian_small_slide_along %>%
+ group_by(ym) %>%
+ slice(1)
# A tsibble: 24 x 7 [1h] <Australia/Melbourne>
# Key: Sensor [1]
# Groups: ym [24]
Sensor Date_Time Date Time Count ym col
<chr> <dttm> <date> <int> <int> <mth> <dbl>
1 Birrarung Marr 2015-01-01 00:00:00 2015-01-01 0 1630 2015 Jan 227.
2 Birrarung Marr 2015-02-01 00:00:00 2015-02-01 0 178 2015 Feb 180.
3 Birrarung Marr 2015-03-01 00:00:00 2015-03-01 0 69 2015 Mar 112.
4 Birrarung Marr 2015-04-01 00:00:00 2015-04-01 0 25 2015 Apr 147.
5 Birrarung Marr 2015-05-01 00:00:00 2015-05-01 0 13 2015 May 165.
6 Birrarung Marr 2015-06-01 00:00:00 2015-06-01 0 9 2015 Jun 160.
7 Birrarung Marr 2015-07-01 00:00:00 2015-07-01 0 21 2015 Jul 120.
8 Birrarung Marr 2015-08-01 00:00:00 2015-08-01 0 86 2015 Aug 130.
9 Birrarung Marr 2015-09-01 00:00:00 2015-09-01 0 12 2015 Sep 141
10 Birrarung Marr 2015-10-01 00:00:00 2015-10-01 0 22 2015 Oct 168.
# … with 14 more rows
Another option is a dplyr verb slide-arize() that would internally alter the virtual groups. This would mean there is no need for a subclass, which I'm all for. I'm sure it could be refined a bit to work right with tsibble objects. Notice how much it simplifies the example from the window function vignette
(it breaks the yearmonth class too when binding groups together, but maybe that can be fixed)
set.seed(123)
date <- as.Date(c("2019-01-01", "2019-01-02", "2019-02-01", "2019-02-02", "2019-02-03", "2019-03-01", "2019-04-01", "2019-04-02"))
idx <- yearmonth(date)
tib <- tibble(
g = c("A", "A", "A", "B", "B", "B", "B", "B"),
x = rnorm(length(g)),
date = date,
idx = idx
)
tib
#> # A tibble: 8 x 4
#> g x date idx
#> <chr> <dbl> <date> <mth>
#> 1 A -0.560 2019-01-01 2019 Jan
#> 2 A -0.230 2019-01-02 2019 Jan
#> 3 A 1.56 2019-02-01 2019 Feb
#> 4 B 0.0705 2019-02-02 2019 Feb
#> 5 B 0.129 2019-02-03 2019 Feb
#> 6 B 1.72 2019-03-01 2019 Mar
#> 7 B 0.461 2019-04-01 2019 Apr
#> 8 B -1.27 2019-04-02 2019 Apr
tib %>%
slidearize(y = mean(x), .index = idx, .before = 1)
#> # A tibble: 4 x 2
#> idx y
#> <mth> <dbl>
#> 1 2019 Jan -0.395
#> 2 2019 Feb 0.194
#> 3 2019 Mar 0.868
#> 4 2019 Apr 0.304
tib %>%
group_by(g) %>%
slidearize(y = mean(x), .index = idx, .before = 1)
#> # A tibble: 5 x 3
#> # Groups: g [2]
#> g idx y
#> <chr> <date> <dbl>
#> 1 A 2019-01-01 -0.395
#> 2 A 2019-02-01 0.256
#> 3 B 2019-02-01 -0.395
#> 4 B 2019-03-01 0.256
#> 5 B 2019-04-01 0.586
as_tibble(pedestrian) %>%
mutate(ym = yearmonth(Date)) %>%
slidearize(Count = mean(Count), .index = ym, .before = 1, .after = 1)
#> # A tibble: 24 x 2
#> ym Count
#> <mth> <dbl>
#> 1 2015 Jan 559.
#> 2 2015 Feb 626.
#> 3 2015 Mar 658.
#> 4 2015 Apr 678.
#> 5 2015 May 637.
#> 6 2015 Jun 651.
#> 7 2015 Jul 627.
#> 8 2015 Aug 635.
#> 9 2015 Sep 646.
#> 10 2015 Oct 661.
#> # … with 14 more rows
as_tibble(pedestrian) %>%
mutate(ym = yearmonth(Date)) %>%
group_by(Sensor) %>%
slidearize(Count = mean(Count), .index = ym, .before = 1, .after = 1)
#> # A tibble: 95 x 3
#> # Groups: Sensor [4]
#> Sensor ym Count
#> <chr> <date> <dbl>
#> 1 Birrarung Marr 2015-01-01 592.
#> 2 Birrarung Marr 2015-02-01 634.
#> 3 Birrarung Marr 2015-03-01 546.
#> 4 Birrarung Marr 2015-04-01 554.
#> 5 Birrarung Marr 2015-05-01 397.
#> 6 Birrarung Marr 2015-06-01 429.
#> 7 Birrarung Marr 2015-07-01 390.
#> 8 Birrarung Marr 2015-08-01 398.
#> 9 Birrarung Marr 2015-09-01 392.
#> 10 Birrarung Marr 2015-10-01 539.
#> # … with 85 more rows
pedestrian %>%
mutate(ym = yearmonth(Date_Time)) %>%
tidyr::nest(data = c(-Sensor, -ym)) %>%
group_by(Sensor) %>%
mutate(Monthly_MA = slide_dbl(data, ~ mean(.$Count, na.rm = TRUE), .size = 3, .align = "center", .bind = TRUE
))
#> # A tibble: 95 x 4
#> # Groups: Sensor [4]
#> Sensor ym data Monthly_MA
#> <chr> <mth> <list> <dbl>
#> 1 Birrarung Marr 2015 Jan <tsibble [744 × 4]> NA
#> 2 Birrarung Marr 2015 Feb <tsibble [672 × 4]> 634.
#> 3 Birrarung Marr 2015 Mar <tsibble [744 × 4]> 546.
#> 4 Birrarung Marr 2015 Apr <tsibble [720 × 4]> 554.
#> 5 Birrarung Marr 2015 May <tsibble [144 × 4]> 397.
#> 6 Birrarung Marr 2015 Jun <tsibble [720 × 4]> 429.
#> 7 Birrarung Marr 2015 Jul <tsibble [744 × 4]> 390.
#> 8 Birrarung Marr 2015 Aug <tsibble [744 × 4]> 398.
#> 9 Birrarung Marr 2015 Sep <tsibble [720 × 4]> 392.
#> 10 Birrarung Marr 2015 Oct <tsibble [119 × 4]> 539.
#> # … with 85 more rows
make_virtual_group_info <- function(index, rows, before, after, step, sym) {
index <- vec_slice(index, rows)
index_split <- vec_split_id(index)
endpoints <- locate_endpoints(before, after, step, index_split$key) # the only special thing
new_rows <- purrr::map2(endpoints$starts, endpoints$stops, ~vec_c(!!!vec_slice(index_split$id, seq2(.x, .y))))
tibble(!!sym := index_split$key, !!expr(.rows) := new_rows)
}
slidearize <- function(.data, ...) {
UseMethod("slidearize")
}
slidearize.tbl_df <- function(.data, ..., .index, .before = 0L, .after = 0L, .step = 1L) {
.index_chr <- tidyselect::vars_select(names(.data), !!enquo(.index))
.index_sym <- sym(.index_chr)
.index <- dplyr::pull(.data, .index_chr)
group_info <- make_virtual_group_info(
.index,
vec_seq_along(.index),
before = .before,
after = .after,
step = .step,
sym = .index_sym
)
.data <- new_grouped_df(.data, group_info)
summarise(.data, ...)
}
slidearize.grouped_df <- function(.data, ..., .index, .before = 0L, .after = 0L, .step = 1L) {
.index_chr <- tidyselect::vars_select(names(.data), !!enquo(.index))
.index_sym <- sym(.index_chr)
.index <- dplyr::pull(.data, .index_chr)
group_info <- dplyr::group_data(.data)
group_rows <- group_info[[".rows"]]
group_info[[".rows"]] <- lapply(
group_rows,
make_virtual_group_info,
index = .index,
before = .before,
after = .after,
step = .step,
sym = .index_sym
)
group_info <- tidyr::unnest(group_info, cols = .rows)
attr(.data, "groups") <- group_info
summarise(.data, ...)
}
If we were to have a family of roll_by() functions I'd expect them to work with mutate(), filter(), etc. To achieve this a subclass would be required unfortunately (upon confirming this, I put the idea on hold for a bit).
While slidarise() is easy to make, it feels limited to me. Almost like a shortcut for data %>% slide_by() %>% summarise().
For instance, a common use of rolling windows is to compare the raw data with a rolling mean.
library(tsibble)
library(dplyr)
library(ggplot2)
tsibbledata::gafa_stock %>%
group_by(Symbol) %>%
mutate(Close_MA = slide_dbl(Close, mean, .size = 7, .align = "c"))
#> # A tsibble: 5,032 x 9 [!]
#> # Key: Symbol [4]
#> # Groups: Symbol [4]
#> Symbol Date Open High Low Close Adj_Close Volume Close_MA
#> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 AAPL 2014-01-02 79.4 79.6 78.9 79.0 67.0 58671200 NA
#> 2 AAPL 2014-01-03 79.0 79.1 77.2 77.3 65.5 98116900 NA
#> 3 AAPL 2014-01-06 76.8 78.1 76.2 77.7 65.9 103152700 NA
#> 4 AAPL 2014-01-07 77.8 78.0 76.8 77.1 65.4 79302300 77.4
#> 5 AAPL 2014-01-08 77.0 77.9 77.0 77.6 65.8 64632400 77.0
#> 6 AAPL 2014-01-09 78.1 78.1 76.5 76.6 65.0 69787200 77.1
#> 7 AAPL 2014-01-10 77.1 77.3 75.9 76.1 64.5 76244000 77.4
#> 8 AAPL 2014-01-13 75.7 77.5 75.7 76.5 64.9 94623200 77.7
#> 9 AAPL 2014-01-14 76.9 78.1 76.8 78.1 66.1 83140400 77.6
#> 10 AAPL 2014-01-15 79.1 80.0 78.8 79.6 67.5 97909700 77.9
#> # … with 5,022 more rows

The theoretical equivalent with roll_by would be
library(tsibble)
library(dplyr)
library(ggplot2)
tsibbledata::gafa_stock %>%
group_by(Symbol) %>%
slide_by(.size = 7, .align = "c") %>%
mutate(Close_MA = mean(Close))
Any differences in dimension are padded according to .fill = NA. Of course things like .align are up for improvement (as done in slurrr). If using summarise() then the dimension would be reduced to the number of rolling windows.