caretEnsemble caretList produces incorrect resample data

I'm using the following code to train multiple caret models and it looks like caretList is duplicating row in resample.

fitControl3 <- trainControl(
  method='cv',
  number=5,
  savePredictions=TRUE,
  classProbs=TRUE,
  index=createResample(train_sub$target, 5),
  summaryFunction=twoClassSummary
)

model.list3 <- caretList(
  train_sub$target ~ ., 
  preProcess=NULL,
  data = train_sub,
  metric='ROC',
  trControl= fitControl3,
  tuneList=list(
    glmBoost=caretModelSpec(method='glmboost', tuneGrid=expand.grid(mstop=seq(1900, 2000, by=100),prune=c('no'))),
    glm=caretModelSpec(method='glm'),
    pls=caretModelSpec(method='pls',  tuneGrid=expand.grid(ncomp=c(20))),
    xgbtree=caretModelSpec(method='xgbTree', tuneGrid=expand.grid(eta=c(0.01), 
                                                                  max_depth=c(9), 
                                                                  nrounds=c(3000))),
    rf1=caretModelSpec(method='parRF',  ntree=100, tuneGrid=expand.grid(mtry=c(12, 14, 18)))
  )
)
save(model.list3, file='xgb_rf_glmb_glm_pls_cv_5_all.RData')

if I load the r object 'xgb_rf_glmb_glm_pls_cv_5_all.RData' here is what I see for glmboost model vs glm (and all other models in the list)

> model.list3[1]$glmBoost
Boosted Generalized Linear Model 

7262 samples
 333 predictors
   2 classes: 'no', 'yes' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 7262, 7262, 7262, 7262, 7262 
Resampling results across tuning parameters:

  mstop  ROC        Sens       Spec       ROC SD       Sens SD      Spec SD   
  1900   0.7188322  0.9542160  0.1867477  0.009241549  0.008581122  0.01179033
  2000   0.7187255  0.9536293  0.1877144  0.009214250  0.008817720  0.01259558

Tuning parameter 'prune' was held constant at a value of no
ROC was used to select the optimal model using  the largest value.
The final values used for the model were mstop = 1900 and prune = no. 
> model.list3[1]$glmBoost$resample
         ROC      Sens      Spec  Resample
1  0.7224517 0.9657869 0.1810897 Resample1
2  0.7224517 0.9657869 0.1810897 Resample1
3  0.7330123 0.9598236 0.1977671 Resample2
4  0.7330123 0.9598236 0.1977671 Resample2
5  0.7079159 0.9531936 0.1803543 Resample3
6  0.7079159 0.9531936 0.1803543 Resample3
7  0.7190849 0.9419862 0.2019386 Resample4
8  0.7190849 0.9419862 0.2019386 Resample4
9  0.7116962 0.9502896 0.1725888 Resample5
10 0.7116962 0.9502896 0.1725888 Resample5
> model.list3[2]$glm$resample
        ROC      Sens      Spec  Resample
1 0.6972258 0.9130010 0.2948718 Resample1
2 0.7146589 0.9255267 0.2599681 Resample2
3 0.6906046 0.9107752 0.2753623 Resample3
4 0.7061589 0.9006883 0.2714055 Resample4
5 0.7074587 0.9107143 0.2774958 Resample5
>

Obviously I can't run the caretEnsemble method with model.list3. It (understandably) give this error:

Error in check_bestpreds_resamples(modelLibrary) : 
  Component models do not have the same re-sampling strategies

Sep 12 '15 20:09 farbodr

Can you try the same procedure with the same data, but use separate X and Y vectors instead of the formula interface? Looking at this I have a suspicion it is something with the formula interface, which we admittedly don't test very well in our unit tests.

Sep 16 '15 13:09 jknowles

Good catch @jknowles. @farbodr the formula interface is really sub-optimal. Try the X/Y interface instead.

Sep 16 '15 13:09 zachmayer

@farbodr Does this issue occur if you use the X/Y interface and caretEnsemble 2.0.0 from CRAN?

Feb 16 '16 17:02 zachmayer

I haven't but will give it a try this weekend.

Feb 16 '16 17:02 farbodr

I couldn't find my original example so I used another one and X/Y still produces same error. The interesting thing is that if I remove glmboost from the model list the problem goes away. I can put something together with smaller data set so I can post it here if that helps.

Feb 28 '16 20:02 farbodr

Try a caret::train model on your data, using method='glmboost'.

I've had problems with that model in the past.

Feb 28 '16 22:02 zachmayer

I am also getting this bug, but switching to X/Y instead of the formula interface brakes random forest with error: Error in predict.randomForest(modelFit, newdata, type = "prob") :missing values in newdata.

caretEnsable running only an rf model works fine through the formula interface.

Mar 10 '16 00:03 JasonCEC

Run anyNA(X) and anyNA(Y)

Sent from my iPhone

On Mar 9, 2016, at 7:15 PM, Jason Cohen [email protected] wrote:

I am also getting this bug, but switching to X/Y instead of the formula interface brakes random forest with error: Error in predict.randomForest(modelFit, newdata, type = "prob") :missing values in newdata.

caretEnsable running only an rf model works fine through the formula interface.

— Reply to this email directly or view it on GitHub.

Mar 10 '16 00:03 zachmayer

I am facing the same issue, any updates?

Jan 15 '18 04:01 jashshah

caretEnsemble caretEnsemble copied to clipboard

caretList produces incorrect resample data

caretEnsemble
caretEnsemble copied to clipboard