modeltime Error with modeltime_fit

I am getting an error of Error: Error in analysis(x): object 'splits' not found

Splits and Features:

splits <- initial_time_split(
  data_final_tbl
  , prop = 0.8
  , cumulative = TRUE
)

# Features ----------------------------------------------------------------

recipe_base <- recipe(value ~ ., data = training(splits))

recipe_date <- recipe_base %>%
  step_timeseries_signature(date_col) %>%
  step_rm(matches("(iso$)|(xts$)|(hour)|(min)|(sec)|(am.pm)")) %>%
  step_normalize(contains("index.num"), contains("date_col_year"))

recipe_fourier <- recipe_date %>%
  step_dummy(all_nominal_predictors(), one_hot = TRUE) %>%
  step_fourier(date_col, period = 365/12, K = 1) %>%
  step_YeoJohnson(value, limits = c(0,1))

recipe_fourier_final <- recipe_fourier %>%
  step_nzv(all_predictors())

recipe_pca <- recipe_base %>%
  step_timeseries_signature(date_col) %>%
  step_rm(matches("(iso$)|(xts$)|(hour)|(min)|(sec)|(am.pm)")) %>%
  step_dummy(all_nominal_predictors(), one_hot = TRUE) %>%
  step_normalize(value) %>%
  step_fourier(date_col, period = 365/52, K = 1) %>%
  step_normalize(all_numeric_predictors()) %>%
  step_nzv(all_predictors()) %>%
  step_pca(all_numeric_predictors(), threshold = .95)

recipe_num_only <- recipe_pca %>%
  step_rm(-value, -all_numeric_predictors())

Make the model_spec

# XGBoost -----------------------------------------------------------------

model_spec_boost <- boost_tree(
  mode  = "regression",
  mtry  = round(sqrt(ncol(training(splits)) - 1), 0),
  trees = round(sqrt(nrow(training(splits)) - 1), 0),
  min_n = round(sqrt(ncol(training(splits)) - 1), 0),
  tree_depth = round(sqrt(ncol(training(splits)) - 1), 0),
  learn_rate = 0.3,
  loss_reduction = 0.01
) %>%
  set_engine("xgboost")

wfsets <- workflow_set(
  preproc = list(
    base          = recipe_base,
    date          = recipe_date,
    fourier       = recipe_fourier,
    fourier_final = recipe_fourier_final,
    pca           = recipe_pca,
    num_only_pca  = recipe_num_only
  ),
  models = list(
    model_spec_boost
  ),
  cross = TRUE
)

parallel_start(n_cores)
wf_fits <- wfsets %>% 
  modeltime_fit_workflowset(
    data = training(splits)
    , control = control_fit_workflowset(
      allow_par = TRUE
      , verbose = TRUE
    )
  )
parallel_stop()

Gives the error:

Using existing parallel backend with 5 clusters (cores)...
 Beginning Parallel Loop | 0.005 seconds
Model 1 Error: Error in analysis(x): object 'splits' not found

Model 2 Error: Error in analysis(x): object 'splits' not found

Model 3 Error: Error in analysis(x): object 'splits' not found

Model 4 Error: Error in analysis(x): object 'splits' not found

Model 5 Error: Error in analysis(x): object 'splits' not found

Model 6 Error: Error in analysis(x): object 'splits' not found

 Finishing parallel backend. Clusters are remaining open. | 2.909 seconds
 Close clusters by running: `parallel_stop()`.
 Total time | 2.909 seconds


-- Model Failure Report ------------------------------------
# A tibble: 6 x 2
  .model_id .model
      <int> <list>
1         1 <NULL>
2         2 <NULL>
3         3 <NULL>
4         4 <NULL>
5         5 <NULL>
6         6 <NULL>

Some models failed during fitting: modeltime_fit_workflowset():
- Model 1: Is NULL.
- Model 2: Is NULL.
- Model 3: Is NULL.
- Model 4: Is NULL.
- Model 5: Is NULL.
- Model 6: Is NULL.

Action: Review any error messages.
-- End Model Failure Report --------------------------------

Yet when I do the following:

model_spec_boost <- boost_tree(
  mode  = "regression",
  mtry  = round(sqrt(ncol(training(splits)) - 1), 0),
  trees = round(sqrt(nrow(training(splits)) - 1), 0),
  min_n = round(sqrt(ncol(training(splits)) - 1), 0),
  tree_depth = round(sqrt(ncol(training(splits)) - 1), 0),
  learn_rate = 0.3,
  loss_reduction = 0.01
) %>%
  set_engine("xgboost")

wflw_fit_xgboost <- workflow() %>%
  add_recipe(recipe_num_only) %>%
  add_model(model_spec_boost) %>%
  fit(training(splits))

mdl_tbl <- modeltime_table(wflw_fit_xgboost)

calibration_tbl <- mdl_tbl %>%
  modeltime_calibrate(new_data = testing(splits))

calibration_tbl %>%
  modeltime_forecast(
    new_data = testing(splits)
    , actual_data = data_final_tbl
  ) %>%
  plot_modeltime_forecast(
    .conf_interval_show = FALSE
  )

I get a plot

Aug 05 '21 15:08 spsanderson

Hi @spsanderson ,

If you are getting this error it is possibly because some variable is not ok in your base recipe and XGBoost is not accepting it. By not being ok in the base recipe it is being carried over to all the other recipes. For example, if you have a date field or a factor you should remove the date and transform the factor field to dummies for example.

Here is an example reproducing your problem:

splits <- initial_time_split(
    m4_monthly
    , prop = 0.8
    , cumulative = TRUE
)

recipe_base_bad <- recipe(value ~ ., data = training(splits))

recipe_base_ok <- recipe(value ~ ., data = training(splits)) %>%
                step_rm(date) %>%
                step_dummy(all_nominal_predictors(), one_hot = TRUE)



model_spec_boost <- boost_tree(
    mode  = "regression",
    mtry  = round(sqrt(ncol(training(splits)) - 1), 0),
    trees = round(sqrt(nrow(training(splits)) - 1), 0),
    min_n = round(sqrt(ncol(training(splits)) - 1), 0),
    tree_depth = round(sqrt(ncol(training(splits)) - 1), 0),
    learn_rate = 0.3,
    loss_reduction = 0.01
) %>%
    set_engine("xgboost")



wfsets <- workflow_set(
    preproc = list(
        base          = recipe_base_ok  
    ),
    models = list(
        model_spec_boost
    ),
    cross = TRUE
)


wf_fits <- wfsets %>% 
    modeltime_fit_workflowset(
        data = training(splits)
        , control = control_fit_workflowset(
            allow_par = FALSE
            , verbose = TRUE
        )
    )

Hope it helps

Aug 06 '21 12:08 AlbertoAlmuinha

@AlbertoAlmuinha I expect that 4 of the models will fail as they contain a date feature and some have other non-numeric features, but the recipe that I am using in the second example recipe_num_only works on its on the modeltime workflow but not inside of the modeltime_fit_workflowsets which is confusing to me, working on it's own but not in the workflowsets does not make sense, the error itself is also to me confusing, to me it says it literally cannot find my splits object.

Aug 06 '21 13:08 spsanderson

@spsanderson It's difficult to say without a reprex to play a bit with the data. Yeah, the error description is not the best one, but that part is difficult to control...

Aug 06 '21 15:08 AlbertoAlmuinha

data_tbl.xlsx juiced_recipe.xlsx

Please see attached data to help

Aug 06 '21 15:08 spsanderson

With which recipe did you create the attached excel? Don't fit with any of the recipes in the first message

Aug 06 '21 16:08 AlbertoAlmuinha

recipe_num_only

Aug 06 '21 16:08 spsanderson

I don't get that result with recipe_num_only. The first message recipe is:

recipe_num_only <- recipe_pca %>%
  step_rm(-value, -all_numeric_predictors())

In the last step you are keeping a "value" column...you don't have any value column in the attached information....so I'm missing something here

Aug 06 '21 16:08 AlbertoAlmuinha

juiced_recipe.xlsx

Sorry here you go column is now there.

The data_tbl is my original data.

Aug 06 '21 16:08 spsanderson

@spsanderson @mdancho84

Definitely we need to take a look at this because something is going on. Apparently it should work (and in fact, if you launch it in sequential it works correctly) but the parallel functionality for some reason is not working correctly I think. We need to check this.

For the moment, you can use allow_par = FALSE to make it work

Aug 06 '21 17:08 AlbertoAlmuinha

@AlbertoAlmuinha thanks I will try it right now and let you know

That did it.

Aug 06 '21 17:08 spsanderson

@spsanderson ok, I found the problem...Right now you can't define the model based on other variables (splits in this case):

model_spec_boost <- boost_tree(
    mode  = "regression",
    mtry  = round(sqrt(ncol(training(splits)) - 1), 0),
    trees = round(sqrt(nrow(training(splits)) - 1), 0),
    min_n = round(sqrt(ncol(training(splits)) - 1), 0),
    tree_depth = round(sqrt(ncol(training(splits)) - 1), 0),
    learn_rate = 0.3,
    loss_reduction = 0.01
) %>%
    set_engine("xgboost")

What happens is that these variables are not sent to the nodes where the computation is performed and therefore when the calculation is going to be performed it fails because it does not find them. If you change the variables by a number you will see that everything works correctly:

model_spec_boost <- boost_tree(
    mode  = "regression",
    mtry  = 1,
    trees = 8,
    min_n = 1,
    tree_depth = 1,
    learn_rate = 0.3,
    loss_reduction = 0.01
) %>%
    set_engine("xgboost")

Maybe we can find a solution to improve this situation,

Regards

Aug 06 '21 18:08 AlbertoAlmuinha

Ahhhhhhhh ok, I don't think a solution to this is necessary, I think it would be better if those settings were made outside of the spec process.

Aug 06 '21 18:08 spsanderson

What do you think about this @mdancho84 ?? We could include an "export" argument to modeltime_fit_workflowsets() which would be a object (or a named list if multiples objects are required) and export to the nodes this "export" object. The implementation would be quite easy.

Or do you prefer to leave things as they are?

Aug 06 '21 19:08 AlbertoAlmuinha

It's a unique case but we can add an exports arg inside of the control objects. That might help.

Aug 23 '21 20:08 mdancho84

modeltime
modeltime copied to clipboard

Error with modeltime_fit_workflowsets

modeltime modeltime copied to clipboard

Error with modeltime_fit_workflowsets

modeltime
modeltime copied to clipboard