caretEnsemble
caretEnsemble copied to clipboard
caretList produces incorrect resample data
I'm using the following code to train multiple caret models and it looks like caretList is duplicating row in resample.
fitControl3 <- trainControl(
method='cv',
number=5,
savePredictions=TRUE,
classProbs=TRUE,
index=createResample(train_sub$target, 5),
summaryFunction=twoClassSummary
)
model.list3 <- caretList(
train_sub$target ~ .,
preProcess=NULL,
data = train_sub,
metric='ROC',
trControl= fitControl3,
tuneList=list(
glmBoost=caretModelSpec(method='glmboost', tuneGrid=expand.grid(mstop=seq(1900, 2000, by=100),prune=c('no'))),
glm=caretModelSpec(method='glm'),
pls=caretModelSpec(method='pls', tuneGrid=expand.grid(ncomp=c(20))),
xgbtree=caretModelSpec(method='xgbTree', tuneGrid=expand.grid(eta=c(0.01),
max_depth=c(9),
nrounds=c(3000))),
rf1=caretModelSpec(method='parRF', ntree=100, tuneGrid=expand.grid(mtry=c(12, 14, 18)))
)
)
save(model.list3, file='xgb_rf_glmb_glm_pls_cv_5_all.RData')
if I load the r object 'xgb_rf_glmb_glm_pls_cv_5_all.RData' here is what I see for glmboost model vs glm (and all other models in the list)
> model.list3[1]$glmBoost
Boosted Generalized Linear Model
7262 samples
333 predictors
2 classes: 'no', 'yes'
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 7262, 7262, 7262, 7262, 7262
Resampling results across tuning parameters:
mstop ROC Sens Spec ROC SD Sens SD Spec SD
1900 0.7188322 0.9542160 0.1867477 0.009241549 0.008581122 0.01179033
2000 0.7187255 0.9536293 0.1877144 0.009214250 0.008817720 0.01259558
Tuning parameter 'prune' was held constant at a value of no
ROC was used to select the optimal model using the largest value.
The final values used for the model were mstop = 1900 and prune = no.
> model.list3[1]$glmBoost$resample
ROC Sens Spec Resample
1 0.7224517 0.9657869 0.1810897 Resample1
2 0.7224517 0.9657869 0.1810897 Resample1
3 0.7330123 0.9598236 0.1977671 Resample2
4 0.7330123 0.9598236 0.1977671 Resample2
5 0.7079159 0.9531936 0.1803543 Resample3
6 0.7079159 0.9531936 0.1803543 Resample3
7 0.7190849 0.9419862 0.2019386 Resample4
8 0.7190849 0.9419862 0.2019386 Resample4
9 0.7116962 0.9502896 0.1725888 Resample5
10 0.7116962 0.9502896 0.1725888 Resample5
> model.list3[2]$glm$resample
ROC Sens Spec Resample
1 0.6972258 0.9130010 0.2948718 Resample1
2 0.7146589 0.9255267 0.2599681 Resample2
3 0.6906046 0.9107752 0.2753623 Resample3
4 0.7061589 0.9006883 0.2714055 Resample4
5 0.7074587 0.9107143 0.2774958 Resample5
>
Obviously I can't run the caretEnsemble method with model.list3. It (understandably) give this error:
Error in check_bestpreds_resamples(modelLibrary) :
Component models do not have the same re-sampling strategies
Can you try the same procedure with the same data, but use separate X and Y vectors instead of the formula interface? Looking at this I have a suspicion it is something with the formula interface, which we admittedly don't test very well in our unit tests.
Good catch @jknowles. @farbodr the formula interface is really sub-optimal. Try the X
/Y
interface instead.
@farbodr Does this issue occur if you use the X/Y
interface and caretEnsemble 2.0.0 from CRAN?
I haven't but will give it a try this weekend.
I couldn't find my original example so I used another one and X/Y still produces same error. The interesting thing is that if I remove glmboost from the model list the problem goes away. I can put something together with smaller data set so I can post it here if that helps.
Try a caret::train
model on your data, using method='glmboost'
.
I've had problems with that model in the past.
I am also getting this bug, but switching to X/Y instead of the formula interface brakes random forest with error: Error in predict.randomForest(modelFit, newdata, type = "prob") :missing values in newdata
.
caretEnsable running only an rf model works fine through the formula interface.
Run anyNA(X) and anyNA(Y)
Sent from my iPhone
On Mar 9, 2016, at 7:15 PM, Jason Cohen [email protected] wrote:
I am also getting this bug, but switching to X/Y instead of the formula interface brakes random forest with error: Error in predict.randomForest(modelFit, newdata, type = "prob") :missing values in newdata.
caretEnsable running only an rf model works fine through the formula interface.
— Reply to this email directly or view it on GitHub.
I am facing the same issue, any updates?