tsibble Experiment with roll

trafficstars

slide_by(), stretch_by(), tile_by() which integrates with dplyr's mutate(), summarise(), etc.

Jul 26 '19 05:07 mitchelloharawild

While the data.frame method for these functions should operate on row indices, it is possible for the tbl_ts method to also support time-aware inputs such as durations (#130).

Jul 26 '19 09:07 mitchelloharawild

The main issue I’m concerning is the mable object once modeling the slided tsibble, model can’t be lazily evaluated.

Jul 27 '19 01:07 earowang

I talked to Hadley about this. He thinks it's a potentially good approach, more like a lazily evaluated grouped data frame. He also mentioned the tidymodels bootstrapping approach https://davisvaughan.github.io/strapgod/articles/dplyr-support.html

Jul 31 '19 13:07 earowang

That sounds promising. There will need to be more work than I had hoped to get mutate(), filter(), etc. to behave as expected - so a new class with dplyr methods will be required.

Aug 01 '19 02:08 mitchelloharawild

While I think this functionality is useful, I want to wait to see more developments with slurrr (#143) before working on this further.

The original motivating issue with the window "id" being part of the key structure has now been resolved in fabletools by supporting disjoint hierarchies.

Further, if slurrr is the package location going further, this functionality should belong in that package. Maybe @DavisVaughan can consider it's implementation/use, and tsibble can later add a tbl_ts method that supports time aware sliding inputs.

Aug 15 '19 06:08 mitchelloharawild

oh, i think slide_by should stay in tsibble, no conflicts with slurrr.

On Thu, 15 Aug 2019 at 16:31, mitchelloharawild [email protected] wrote:

While I think this functionality is useful, I want to wait to see more developments with slurrr (#143 https://github.com/tidyverts/tsibble/issues/143) before working on this further.

The original motivating issue with the window "id" being part of the key structure has now been resolved in fabletools by supporting disjoint hierarchies.

Further, if slurrr is the package location going further, this functionality should belong in that package. Maybe @DavisVaughan https://github.com/DavisVaughan can consider it's implementation/use, and tsibble can later add a tbl_ts method that supports time aware sliding inputs.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/tidyverts/tsibble/issues/137?email_source=notifications&email_token=AASXVGE6RBGVGM7EEBPUILTQETZ2NA5CNFSM4IHAT4LKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4K67YA#issuecomment-521531360, or mute the thread https://github.com/notifications/unsubscribe-auth/AASXVGFXDY7VRWQAOYWLZKLQETZ2NANCNFSM4IHAT4LA .

Aug 15 '19 06:08 earowang

Oooh I had an idea for this too yesterday. It does use the lazy grouping feature of dplyr, but would require a new data frame subclass.

I'm not quite convinced that it would be worth it over a version of slide() that has an .index argument, which I mention in #143. They would essentially compute the same thing, but the size of the slide() result would be vec_size(.x) and the size of the slide_by result would be vec_size(unique(.index)).

I think it would be pretty hard (not impossible) to get another df subclass that would be able to 1) add and remove virtual groups correctly 2) be able to subclass on top of tsibble or a data frame, and know how to drop back to a pure tsibble after some kind of summary operation where the virtual groups are materialized. I think its much "easier" to let slide() be unaware of dplyr, but still do what you'd want here.

This is just a partial example of how it might look.

> pedestrian_small
# A tsibble: 240 x 6 [1h] <Australia/Melbourne>
# Key:       Sensor [1]
   Sensor         Date_Time           Date        Time Count       ym
   <chr>          <dttm>              <date>     <int> <int>    <mth>
 1 Birrarung Marr 2015-01-01 00:00:00 2015-01-01     0  1630 2015 Jan
 2 Birrarung Marr 2015-01-01 01:00:00 2015-01-01     1   826 2015 Jan
 3 Birrarung Marr 2015-01-01 02:00:00 2015-01-01     2   567 2015 Jan
 4 Birrarung Marr 2015-01-01 03:00:00 2015-01-01     3   264 2015 Jan
 5 Birrarung Marr 2015-01-01 04:00:00 2015-01-01     4   139 2015 Jan
 6 Birrarung Marr 2015-01-01 05:00:00 2015-01-01     5    77 2015 Jan
 7 Birrarung Marr 2015-01-01 06:00:00 2015-01-01     6    44 2015 Jan
 8 Birrarung Marr 2015-01-01 07:00:00 2015-01-01     7    56 2015 Jan
 9 Birrarung Marr 2015-01-01 08:00:00 2015-01-01     8   113 2015 Jan
10 Birrarung Marr 2015-01-01 09:00:00 2015-01-01     9   166 2015 Jan
# … with 230 more rows
> 
> unique(pedestrian_small$ym)
 [1] "2015 Jan" "2015 Feb" "2015 Mar" "2015 Apr" "2015 May" "2015 Jun" "2015 Jul" "2015 Aug" "2015 Sep" "2015 Oct"
[11] "2015 Nov" "2015 Dec" "2016 Jan" "2016 Feb" "2016 Mar" "2016 Apr" "2016 May" "2016 Jun" "2016 Jul" "2016 Aug"
[21] "2016 Sep" "2016 Oct" "2016 Nov" "2016 Dec"
> 
> pedestrian_small_slide_grouped <- slide_by(pedestrian_small, ym, before = 1, after = 1, step = 1)
> pedestrian_small_slide_grouped
# A tibble: 240 x 6
# Groups:   .start, .stop [24]
   Sensor         Date_Time           Date        Time Count       ym
   <chr>          <dttm>              <date>     <int> <int>    <mth>
 1 Birrarung Marr 2015-01-01 00:00:00 2015-01-01     0  1630 2015 Jan
 2 Birrarung Marr 2015-01-01 01:00:00 2015-01-01     1   826 2015 Jan
 3 Birrarung Marr 2015-01-01 02:00:00 2015-01-01     2   567 2015 Jan
 4 Birrarung Marr 2015-01-01 03:00:00 2015-01-01     3   264 2015 Jan
 5 Birrarung Marr 2015-01-01 04:00:00 2015-01-01     4   139 2015 Jan
 6 Birrarung Marr 2015-01-01 05:00:00 2015-01-01     5    77 2015 Jan
 7 Birrarung Marr 2015-01-01 06:00:00 2015-01-01     6    44 2015 Jan
 8 Birrarung Marr 2015-01-01 07:00:00 2015-01-01     7    56 2015 Jan
 9 Birrarung Marr 2015-01-01 08:00:00 2015-01-01     8   113 2015 Jan
10 Birrarung Marr 2015-01-01 09:00:00 2015-01-01     9   166 2015 Jan
# … with 230 more rows
> 
> pedestrian_small_slide_grouped %>%
+   summarise(
+     ym_start = first(ym),
+     ym_end = last(ym),
+     MonthlyCount = mean(Count)
+   )
# A tibble: 24 x 5
# Groups:   .start [23]
   .start     .stop      ym_start   ym_end MonthlyCount
   <date>     <date>        <mth>    <mth>        <dbl>
 1 2015-01-01 2015-02-01 2015 Jan 2015 Feb         227.
 2 2015-01-01 2015-03-01 2015 Jan 2015 Mar         180.
 3 2015-02-01 2015-04-01 2015 Feb 2015 Apr         112.
 4 2015-03-01 2015-05-01 2015 Mar 2015 May         147.
 5 2015-04-01 2015-06-01 2015 Apr 2015 Jun         165.
 6 2015-05-01 2015-07-01 2015 May 2015 Jul         160.
 7 2015-06-01 2015-08-01 2015 Jun 2015 Aug         120.
 8 2015-07-01 2015-09-01 2015 Jul 2015 Sep         130.
 9 2015-08-01 2015-10-01 2015 Aug 2015 Oct         141 
10 2015-09-01 2015-11-01 2015 Sep 2015 Nov         168.
# … with 14 more rows
> 
> pedestrian_small_slide_along <- mutate(
+   pedestrian_small,
+   col = slide_along_impl(Count, ~mean(.x), .index = ym, .ptype = dbl(), .before = 1, .after = 1)
+ )
> 
> pedestrian_small_slide_along
# A tsibble: 240 x 7 [1h] <Australia/Melbourne>
# Key:       Sensor [1]
   Sensor         Date_Time           Date        Time Count       ym   col
   <chr>          <dttm>              <date>     <int> <int>    <mth> <dbl>
 1 Birrarung Marr 2015-01-01 00:00:00 2015-01-01     0  1630 2015 Jan  227.
 2 Birrarung Marr 2015-01-01 01:00:00 2015-01-01     1   826 2015 Jan  227.
 3 Birrarung Marr 2015-01-01 02:00:00 2015-01-01     2   567 2015 Jan  227.
 4 Birrarung Marr 2015-01-01 03:00:00 2015-01-01     3   264 2015 Jan  227.
 5 Birrarung Marr 2015-01-01 04:00:00 2015-01-01     4   139 2015 Jan  227.
 6 Birrarung Marr 2015-01-01 05:00:00 2015-01-01     5    77 2015 Jan  227.
 7 Birrarung Marr 2015-01-01 06:00:00 2015-01-01     6    44 2015 Jan  227.
 8 Birrarung Marr 2015-01-01 07:00:00 2015-01-01     7    56 2015 Jan  227.
 9 Birrarung Marr 2015-01-01 08:00:00 2015-01-01     8   113 2015 Jan  227.
10 Birrarung Marr 2015-01-01 09:00:00 2015-01-01     9   166 2015 Jan  227.
# … with 230 more rows
> 
> # group by ym and slice(1) to get it down to the right number of rows
> pedestrian_small_slide_along %>%
+   group_by(ym) %>%
+   slice(1)
# A tsibble: 24 x 7 [1h] <Australia/Melbourne>
# Key:       Sensor [1]
# Groups:    ym [24]
   Sensor         Date_Time           Date        Time Count       ym   col
   <chr>          <dttm>              <date>     <int> <int>    <mth> <dbl>
 1 Birrarung Marr 2015-01-01 00:00:00 2015-01-01     0  1630 2015 Jan  227.
 2 Birrarung Marr 2015-02-01 00:00:00 2015-02-01     0   178 2015 Feb  180.
 3 Birrarung Marr 2015-03-01 00:00:00 2015-03-01     0    69 2015 Mar  112.
 4 Birrarung Marr 2015-04-01 00:00:00 2015-04-01     0    25 2015 Apr  147.
 5 Birrarung Marr 2015-05-01 00:00:00 2015-05-01     0    13 2015 May  165.
 6 Birrarung Marr 2015-06-01 00:00:00 2015-06-01     0     9 2015 Jun  160.
 7 Birrarung Marr 2015-07-01 00:00:00 2015-07-01     0    21 2015 Jul  120.
 8 Birrarung Marr 2015-08-01 00:00:00 2015-08-01     0    86 2015 Aug  130.
 9 Birrarung Marr 2015-09-01 00:00:00 2015-09-01     0    12 2015 Sep  141 
10 Birrarung Marr 2015-10-01 00:00:00 2015-10-01     0    22 2015 Oct  168.
# … with 14 more rows

Aug 15 '19 12:08 DavisVaughan

Another option is a dplyr verb slide-arize() that would internally alter the virtual groups. This would mean there is no need for a subclass, which I'm all for. I'm sure it could be refined a bit to work right with tsibble objects. Notice how much it simplifies the example from the window function vignette

(it breaks the yearmonth class too when binding groups together, but maybe that can be fixed)

set.seed(123)

date <- as.Date(c("2019-01-01", "2019-01-02", "2019-02-01", "2019-02-02", "2019-02-03", "2019-03-01", "2019-04-01", "2019-04-02"))
idx <- yearmonth(date)

tib <- tibble(
  g = c("A", "A", "A", "B", "B", "B", "B", "B"),
  x = rnorm(length(g)),
  date = date,
  idx = idx
)

tib
#> # A tibble: 8 x 4
#>   g           x date            idx
#>   <chr>   <dbl> <date>        <mth>
#> 1 A     -0.560  2019-01-01 2019 Jan
#> 2 A     -0.230  2019-01-02 2019 Jan
#> 3 A      1.56   2019-02-01 2019 Feb
#> 4 B      0.0705 2019-02-02 2019 Feb
#> 5 B      0.129  2019-02-03 2019 Feb
#> 6 B      1.72   2019-03-01 2019 Mar
#> 7 B      0.461  2019-04-01 2019 Apr
#> 8 B     -1.27   2019-04-02 2019 Apr

tib %>%
  slidearize(y = mean(x), .index = idx, .before = 1)
#> # A tibble: 4 x 2
#>        idx      y
#>      <mth>  <dbl>
#> 1 2019 Jan -0.395
#> 2 2019 Feb  0.194
#> 3 2019 Mar  0.868
#> 4 2019 Apr  0.304

tib %>%
  group_by(g) %>%
  slidearize(y = mean(x), .index = idx, .before = 1)
#> # A tibble: 5 x 3
#> # Groups:   g [2]
#>   g     idx             y
#>   <chr> <date>      <dbl>
#> 1 A     2019-01-01 -0.395
#> 2 A     2019-02-01  0.256
#> 3 B     2019-02-01 -0.395
#> 4 B     2019-03-01  0.256
#> 5 B     2019-04-01  0.586

as_tibble(pedestrian) %>%
  mutate(ym = yearmonth(Date)) %>%
  slidearize(Count = mean(Count), .index = ym, .before = 1, .after = 1)
#> # A tibble: 24 x 2
#>          ym Count
#>       <mth> <dbl>
#>  1 2015 Jan  559.
#>  2 2015 Feb  626.
#>  3 2015 Mar  658.
#>  4 2015 Apr  678.
#>  5 2015 May  637.
#>  6 2015 Jun  651.
#>  7 2015 Jul  627.
#>  8 2015 Aug  635.
#>  9 2015 Sep  646.
#> 10 2015 Oct  661.
#> # … with 14 more rows

as_tibble(pedestrian) %>%
  mutate(ym = yearmonth(Date)) %>%
  group_by(Sensor) %>%
  slidearize(Count = mean(Count), .index = ym, .before = 1, .after = 1)
#> # A tibble: 95 x 3
#> # Groups:   Sensor [4]
#>    Sensor         ym         Count
#>    <chr>          <date>     <dbl>
#>  1 Birrarung Marr 2015-01-01  592.
#>  2 Birrarung Marr 2015-02-01  634.
#>  3 Birrarung Marr 2015-03-01  546.
#>  4 Birrarung Marr 2015-04-01  554.
#>  5 Birrarung Marr 2015-05-01  397.
#>  6 Birrarung Marr 2015-06-01  429.
#>  7 Birrarung Marr 2015-07-01  390.
#>  8 Birrarung Marr 2015-08-01  398.
#>  9 Birrarung Marr 2015-09-01  392.
#> 10 Birrarung Marr 2015-10-01  539.
#> # … with 85 more rows

pedestrian %>%
  mutate(ym = yearmonth(Date_Time)) %>%
  tidyr::nest(data = c(-Sensor, -ym)) %>%
  group_by(Sensor) %>%
  mutate(Monthly_MA = slide_dbl(data, ~ mean(.$Count, na.rm = TRUE), .size = 3, .align = "center", .bind = TRUE
  ))
#> # A tibble: 95 x 4
#> # Groups:   Sensor [4]
#>    Sensor               ym data                Monthly_MA
#>    <chr>             <mth> <list>                   <dbl>
#>  1 Birrarung Marr 2015 Jan <tsibble [744 × 4]>        NA 
#>  2 Birrarung Marr 2015 Feb <tsibble [672 × 4]>       634.
#>  3 Birrarung Marr 2015 Mar <tsibble [744 × 4]>       546.
#>  4 Birrarung Marr 2015 Apr <tsibble [720 × 4]>       554.
#>  5 Birrarung Marr 2015 May <tsibble [144 × 4]>       397.
#>  6 Birrarung Marr 2015 Jun <tsibble [720 × 4]>       429.
#>  7 Birrarung Marr 2015 Jul <tsibble [744 × 4]>       390.
#>  8 Birrarung Marr 2015 Aug <tsibble [744 × 4]>       398.
#>  9 Birrarung Marr 2015 Sep <tsibble [720 × 4]>       392.
#> 10 Birrarung Marr 2015 Oct <tsibble [119 × 4]>       539.
#> # … with 85 more rows

make_virtual_group_info <- function(index, rows, before, after, step, sym) {
  index <- vec_slice(index, rows)
  index_split <- vec_split_id(index)
  endpoints <- locate_endpoints(before, after, step, index_split$key) # the only special thing
  new_rows <- purrr::map2(endpoints$starts, endpoints$stops, ~vec_c(!!!vec_slice(index_split$id, seq2(.x, .y))))
  tibble(!!sym := index_split$key, !!expr(.rows) := new_rows)
}

slidearize <- function(.data, ...) {
  UseMethod("slidearize")
}

slidearize.tbl_df <- function(.data, ..., .index, .before = 0L, .after = 0L, .step = 1L) {
  .index_chr <- tidyselect::vars_select(names(.data), !!enquo(.index))
  .index_sym <- sym(.index_chr)
  .index <- dplyr::pull(.data, .index_chr)

  group_info <- make_virtual_group_info(
    .index,
    vec_seq_along(.index),
    before = .before,
    after = .after,
    step = .step,
    sym = .index_sym
  )

  .data <- new_grouped_df(.data, group_info)
  summarise(.data, ...)
}

slidearize.grouped_df <- function(.data, ..., .index, .before = 0L, .after = 0L, .step = 1L) {
  .index_chr <- tidyselect::vars_select(names(.data), !!enquo(.index))
  .index_sym <- sym(.index_chr)
  .index <- dplyr::pull(.data, .index_chr)
  group_info <- dplyr::group_data(.data)
  group_rows <- group_info[[".rows"]]

  group_info[[".rows"]] <- lapply(
    group_rows,
    make_virtual_group_info,
    index = .index,
    before = .before,
    after = .after,
    step = .step,
    sym = .index_sym
  )

  group_info <- tidyr::unnest(group_info, cols = .rows)

  attr(.data, "groups") <- group_info

  summarise(.data, ...)
}

Aug 15 '19 15:08 DavisVaughan

If we were to have a family of roll_by() functions I'd expect them to work with mutate(), filter(), etc. To achieve this a subclass would be required unfortunately (upon confirming this, I put the idea on hold for a bit).

While slidarise() is easy to make, it feels limited to me. Almost like a shortcut for data %>% slide_by() %>% summarise().

For instance, a common use of rolling windows is to compare the raw data with a rolling mean.

library(tsibble)
library(dplyr)
library(ggplot2)
tsibbledata::gafa_stock %>% 
  group_by(Symbol) %>% 
  mutate(Close_MA = slide_dbl(Close, mean, .size = 7, .align = "c"))

#> # A tsibble: 5,032 x 9 [!]
#> # Key:       Symbol [4]
#> # Groups:    Symbol [4]
#>    Symbol Date        Open  High   Low Close Adj_Close    Volume Close_MA
#>    <chr>  <date>     <dbl> <dbl> <dbl> <dbl>     <dbl>     <dbl>    <dbl>
#>  1 AAPL   2014-01-02  79.4  79.6  78.9  79.0      67.0  58671200     NA  
#>  2 AAPL   2014-01-03  79.0  79.1  77.2  77.3      65.5  98116900     NA  
#>  3 AAPL   2014-01-06  76.8  78.1  76.2  77.7      65.9 103152700     NA  
#>  4 AAPL   2014-01-07  77.8  78.0  76.8  77.1      65.4  79302300     77.4
#>  5 AAPL   2014-01-08  77.0  77.9  77.0  77.6      65.8  64632400     77.0
#>  6 AAPL   2014-01-09  78.1  78.1  76.5  76.6      65.0  69787200     77.1
#>  7 AAPL   2014-01-10  77.1  77.3  75.9  76.1      64.5  76244000     77.4
#>  8 AAPL   2014-01-13  75.7  77.5  75.7  76.5      64.9  94623200     77.7
#>  9 AAPL   2014-01-14  76.9  78.1  76.8  78.1      66.1  83140400     77.6
#> 10 AAPL   2014-01-15  79.1  80.0  78.8  79.6      67.5  97909700     77.9
#> # … with 5,022 more rows

The theoretical equivalent with roll_by would be

library(tsibble)
library(dplyr)
library(ggplot2)
tsibbledata::gafa_stock %>% 
  group_by(Symbol) %>% 
  slide_by(.size = 7, .align = "c") %>% 
  mutate(Close_MA = mean(Close))

Any differences in dimension are padded according to .fill = NA. Of course things like .align are up for improvement (as done in slurrr). If using summarise() then the dimension would be reduced to the number of rolling windows.

Aug 15 '19 23:08 mitchelloharawild

tsibble
tsibble copied to clipboard

Experiment with roll_by family

tsibble tsibble copied to clipboard

Experiment with roll_by family

tsibble
tsibble copied to clipboard