fabletools implement `future.apply` in `forecast` as it is in `model`

Using fable at any kind of scale gets painful if computation of forecasts isn't also parallelized. I've got a possible implementation I'll PR tomorrow.

Sep 23 '20 01:09 davidtedfordholt

A PR for this would be great, thanks!

Sep 23 '20 01:09 mitchelloharawild

My suggestion for this would be to drop the use of mutate_at() and replace it with a internal util function which iterates over the mable with parallel support. There is a common pattern throughout the package where functions applied to a mable need to operate over each model, and a more general (and parallelisable) helper function for this would be great (for example, generate() should also be possible in parallel).

Sep 23 '20 01:09 mitchelloharawild

I've got to pull a couple of errant commits out, but I'll send it along momentarily. I reversed the order of the pivot_longer and changed the mutate_at to mapply to mirror the future.apply implementation.

Sep 23 '20 01:09 davidtedfordholt

I'm happy to make a util function that does the iteration piece. It would essentially start with something that essentially at the point of having been pivot_longer'ed and iterates across with either mapply or future_mapply, given that they take the same arguments (other than the future-specific added ones). Take everything that's passed in additionally and place it in MoreArgs, then select the method based on the availability of future.

Sep 23 '20 02:09 davidtedfordholt

This is an 80-core machine trying to churn through forecasts for the models it fit between 5:45pm and 6pm. It's still at it, 4 hours later. This is a pretty strong motivator, as I'll be explaining the bill later. :)

Sep 23 '20 02:09 davidtedfordholt

I think it could be safer / more efficient if the internal util would also handle the pivot_longer bit to avoid possible variable name conflicts that exist from the pivoting process.

Sep 23 '20 02:09 mitchelloharawild

Makes sense, and shouldn't be any tougher. It won't pivot_wider to return to the previous structure, because forecast() specifically needs for it not to do so. I can't think of an instance where it would need to, but I can add in a param to return it to the original structure, if you'd like.

Sep 23 '20 02:09 davidtedfordholt

An example of returning to the previous structure is when the operation returns a <mable>.

For example, refit(<mable>, <tsibble>) returns a <mable> of the same structure. The same applies for stream(<mable>, <tsibble>)

Adding/extending the apply-util for this can be done later.

Sep 23 '20 03:09 mitchelloharawild

Makes sense. Should be straightforward enough.

Sep 23 '20 04:09 davidtedfordholt

To make modeling and forecasting more memory efficient for tsibbles with many keys, and this should probably go for any fabletools process, it might make sense to shrink the environment of each worker in the future_mapply calls. This means we'll need to figure out what packages like fable.prophet might be attached that will need to be available. Obviously, if we can generally rely on the idea that the namespace will begin with fable. then it's pretty easy. I don't know if that's something you've thought about and feel strongly about, but it seems reasonable enough.

Otherwise, we might need to crawl across the functions within the call to find all of their namespaces and add them that way, which seems a bit more fraught with peril.

Sep 26 '20 16:09 davidtedfordholt

Just following along as I use fable for millions of time series. If this is in the newest version of the Github release I'd be happy to help test, otherwise just saying I appreciate the work here!

Sep 26 '20 16:09 snowman-87

On further thought and a bug report, converting to longer before parallelising isn't appropriate because the column structure defines the way in which the forecasts are handled (such as reconciliation). I'll have a go at adding the conditionally parallel mapply function.

Sep 28 '20 14:09 mitchelloharawild

Look at what's in my branch right now. There was an issue with how new_data was working, but I have a version that I have been testing all afternoon and is working for it. I didn't understand how key_data was being used, and had messed that portion up, in terms of how it got pushed to forecast along with the new_data.

Sep 28 '20 22:09 davidtedfordholt

Using what's currently in the parallel_forecasting branch, here's an example of parallel modeling and forecasting. I'm sorry it's so long, but I needed something with regressors, and I wanted to make sure I included fable.prophet in there to make sure the capacity and floor for logistic growth works properly.

devtools::install_github("davidtedfordholt/fabletools", ref = "parallel_forecasting")

library(tsibbledata)
library(tsibble)
library(dplyr)
library(tidyr)
library(fable)
library(fabletools)
library(fable.prophet)
library(future)
library(future.apply)
options(future.fork.enable = TRUE, future.rng.onMisuse = "ignore")

data <- gafa_stock %>%
    select(Volume, Open) %>%
    fill_gaps(Volume = 0, .full = TRUE) %>%
    group_by_key() %>%
    fill(Open) %>%
    ungroup()

train <- data %>%
    filter(Date < as.Date("2018-01-01")) %>%
    mutate(capacity = max(Open) * 4,
           floor = min(Open) * .25) %>%
    ungroup()

test <- train %>%
    new_data(n = 365) %>%
    left_join(select(data, -Open), by = c("Date", "Symbol")) %>%
    left_join(train %>%
                  as_tibble() %>%
                  group_by(Symbol) %>%
                  summarise(capacity = last(capacity),
                            floor = last(floor)),
              by = "Symbol")

# SEQUENTIAL --------------------------

plan(sequential)
start_sequential_modeling <- Sys.time()
models <-
    train %>%
    model(
        arima = ARIMA(Open),
        tslm = TSLM(Open ~ Volume + season()),
        naive = NAIVE(Open),
        prophet = prophet(
            Open ~ 
                Volume + 
                growth(type = "logistic", capacity = capacity, floor = floor)))
end_sequential_modeling <- Sys.time()
print(paste("Sequential modeling took", format(end_sequential_modeling - start_sequential_modeling)))
forecasts <- forecast(models, new_data = test)
print(paste("Sequential forecasting took", format(Sys.time() - end_sequential_modeling)))

# MULTICORE --------------------------

plan(multicore)
start_multicore_modeling <- Sys.time()
models_multicore <-
    train %>%
    model(
        arima = ARIMA(Open),
        tslm = TSLM(Open ~ Volume + season()),
        naive = NAIVE(Open),
        prophet = prophet(
            Open ~ 
                Volume + 
                growth(type = "logistic", capacity = capacity, floor = floor)))
end_multicore_modeling <- Sys.time()
print(paste("Multicore modeling took", format(end_multicore_modeling - start_multicore_modeling)))
forecasts_multicore <- forecast(models_multicore, new_data = test)
print(paste("Multicore forecasting took", format(Sys.time() - end_multicore_modeling)))

# MULTISESSION --------------------------

plan(multisession)
start_multisession_modeling <- Sys.time()
models_multisession <-
    train %>%
    model(
        arima = ARIMA(Open),
        tslm = TSLM(Open ~ Volume + season()),
        naive = NAIVE(Open),
        prophet = prophet(
            Open ~ 
                Volume + 
                growth(type = "logistic", capacity = capacity, floor = floor)))
end_multisession_modeling <- Sys.time()
print(paste("Multisession modeling took", format(end_multisession_modeling - start_multisession_modeling)))
forecasts_multisession <- forecast(models_multisession, new_data = test)
print(paste("Multisession forecasting took", format(Sys.time() - end_multisession_modeling)))