caretEnsemble
caretEnsemble copied to clipboard
Multi-class classification greedy optimization
i see that branch Dev has some more progress regarding multi-class classification ensemble stacking but unfortunately it is not yet done. do you plan on implementing this and/or could you point me in the right direction so i might be able to finish it? I don't seem to understand what the problem/holdup is (no offense intended)
Currently jknowles is leading the charge on this. If you would like to add code, please feel free to submit pull requests on the dev branch.
@thomaskern you can probably start by modifying the package to have a 3rd type of model type to identify -- "multiclass" or something like that. This would require modifying the checkModels... function in the utilities.R
After you do that, you just need to add methods for the "multiclass" type to the stack, ensemble, and predict functions and write some unit tests to make sure it is performing as you like.
What optimization function would you use in the multi-class classification setting?
@jknowles I am less worried about the R code and more worried about the optimization function ;) I have no OF in mind and was hoping for some pointers. Have you given this any thought by chance?
I have not. That's the reason I skipped multiclass models for now. It seemed non-trival to write a greedy selection function for them.
I think you should still be able to use another caret model for the ensembling though.
On Fri, Feb 28, 2014 at 2:56 PM, Thomas [email protected] wrote:
@jknowles https://github.com/jknowles I am less worried about the R code and more worried about the optimization function ;) I have no OF in mind and was hoping for some pointers. Have you given this any thought by chance?
Reply to this email directly or view it on GitHubhttps://github.com/zachmayer/caretEnsemble/issues/8#issuecomment-36388211 .
This is also post 1.0. Once we have a working, stable release on CRAN, we'll figure out the multi-class RMSE and AUC optimization functions.
We also need to add multiclass support to caretList and caretStack (caretStack might be easier to do), etc.
Is there any quick way to hack the multiclass classification with greedy optimization? I am participating a kaggle competition...
The quickest way would be to use caretStack with glmnet as the stacker.
In general, I find myself using caretStack more than caretList. It's a lot more flexible, and if you use method='glm' you'll get almost the exact same results as with a greedy ensemble.
@ajing The real key is the optimization function. I haven't even thought of a hack that will work here.... Once someone's written the multiclass optimization function work, we can work on multiclass greedy ensembling.
In methodList=c('rpart', 'nnet'), could I specify a new model as a variable? Because, for xgboost, caret hasn't include that yet, but they can specific a variable modelInfo for xgboost for train function. Could I do something like methodList = c(modelInfo, 'nnet')?
We don't support that yet, but it's definitely on the to-do list.
For now, you can fit the model separately with a call to caret::train, and then add it to the list. To quote from the vignette (please do read it!):
Finally, you should note that caretList does not support custom caret models. Fitting those models are beyond the scope of this vignette, but if you do so, you can manually add them to the model list (e.g.
model_list_big[['my_custom_model']] <- my_custom_model). Just be sure to use the same re-sampling indexes in trControl as you use in the caretList models!
Sorry to ask the question again... For multiclass classification, is that possible to build the ensemble model? I got the error for caretStack with glmnet also...
stack_model <- caretStack(model_list, method='glmnet')
Error in check_caretList_model_types(list_of_models) : Not yet implemented for multiclass problems
Darn. I thought I'd fixed that.
Ok, multi class for caretStack should be easier implement than for caretEnsemble. I'll take a look.
No promises on when I can get to it!
For adding a new customized model, In the example: "Just be sure to use the same re-sampling indexes in trControl as you use in the caretList models!"
What portion I need to change in modelInfo to use the same resampling indexes?
I use the modelInfo of xgboost provided by topepo.
Lets say x is your other model (the one you're ensembling xgboost with):
new_train_control <- trainControl(index=x$control$index, indexOut=x$control$indexOut)
Use the new control to fit your xgboost model.
Thanks! I was thinking about still using caretEnsemble framework. Can I do something like:
my_control <- trainControl(
method='boot',
number=25,
classProbs=TRUE,
index=createResample(training$Class, 25),
summaryFunction=twoClassSummary
)
model_list <- caretList(
Class~., data=training,
trControl=my_control,
tuneList=list(
rf1=caretModelSpec(method='rf', tuneGrid=data.frame(.mtry=2)),
xgb = caretModelSpec(method=modelInfo, tuneLength = 3)
)
As explained in the vignette: "you should note that caretList does not support custom caret models"
You have to fit the second model with a second call to train. E.g.:
my_control <- trainControl(
method='boot',
number=25,
classProbs=TRUE,
index=createResample(training$Class, 25),
summaryFunction=twoClassSummary
)
model_list <- caretList(
Class~., data=training,
trControl=my_control,
tuneList=list(
rf1=caretModelSpec(method='rf', tuneGrid=data.frame(.mtry=2))
)
new_model <- train(
Class~.,
data=training,
trControl=my_control,
method=modelInfo,
tuneLength = 3)
model_list$xgb <- new_model
Thanks for the code! Yes, it works now.
Good! Glad to help
Sorry to bother again.. I got the following error, because some assumption about method variable..
model_list$xgb <- new_model greedy_ensemble <- caretEnsemble(model_list) Error in paste("Model '", method, "' is not in the ", "set of existing models", : object 'method' not found
crap. that's a bug. If you can make a simple test case, I'll add it to the queue to fix.
This could be a test case:
training = data.frame(matrix(runif(120),nrow=20,ncol=6), Class = factor(sample(c("Yes", "No"), 20, replace=T)))
my_control <- trainControl(
method='boot',
number=5,
savePredictions=TRUE,
classProbs=TRUE,
index=createResample(training$Class, 5),
summaryFunction=twoClassSummary
)
model_list <- caretList(
Class~., data=training,
trControl=my_control,
methodList=c('glm', 'rpart')
)
new_model <- train(Class~., data=training, trControl=my_control, method=modelInfo, metric = "ROC", tuneGrid = expand.grid(eta = 0.01,max_depth = 5, nrounds=100))
model_list$xgb <- new_model
greedy_ensemble <- caretEnsemble(model_list)
modelInfo Is
modelInfo <- list(label = "eXtreme Gradient Boosting",
library = c("xgboost", "plyr"),
type = c("Regression", "Classification"),
parameters = data.frame(parameter = c('nrounds', 'max_depth', 'eta'),
class = rep("numeric", 3),
label = c('# Boosting Iterations', 'Max Tree Depth',
'Shrinkage')),
grid = function(x, y, len = NULL) expand.grid(max_depth = seq(1, len),
nrounds = floor((1:len) * 50),
eta = .3),
loop = function(grid) {
loop <- ddply(grid, c("eta", "max_depth"),
function(x) c(nrounds = max(x$nrounds)))
submodels <- vector(mode = "list", length = nrow(loop))
for(i in seq(along = loop$nrounds)) {
index <- which(grid$max_depth == loop$max_depth[i] &
grid$eta == loop$eta[i])
trees <- grid[index, "nrounds"]
submodels[[i]] <- data.frame(nrounds = trees[trees != loop$nrounds[i]])
}
list(loop = loop, submodels = submodels)
},
fit = function(x, y, wts, param, lev, last, classProbs, ...) {
if(is.factor(y)) {
if(length(lev) == 2) {
y <- ifelse(y == lev[1], 1, 0)
dat <- xgb.DMatrix(as.matrix(x), label = y)
out <- xgb.train(list(eta = param$eta,
max_depth = param$max_depth),
data = dat,
nrounds = param$nrounds,
objective = "binary:logistic",
...)
} else {
y <- as.numeric(y) - 1
dat <- xgb.DMatrix(as.matrix(x), label = y)
out <- xgb.train(list(eta = param$eta,
max_depth = param$max_depth),
data = dat,
num_class = length(lev),
nrounds = param$nrounds,
objective = "multi:softprob",
...)
}
} else {
dat <- xgb.DMatrix(as.matrix(x), label = y)
out <- xgb.train(list(eta = param$eta,
max_depth = param$max_depth),
data = dat,
nrounds = param$nrounds,
objective = "reg:linear",
...)
}
out
},
predict = function(modelFit, newdata, submodels = NULL) {
newdata <- xgb.DMatrix(as.matrix(newdata))
out <- predict(modelFit, newdata)
if(modelFit$problemType == "Classification") {
if(length(modelFit$obsLevels) == 2) {
out <- ifelse(out >= .5,
modelFit$obsLevels[1],
modelFit$obsLevels[2])
} else {
out <- matrix(out, ncol = length(modelFit$obsLevels), byrow = TRUE)
out <- modelFit$obsLevels[apply(out, 1, which.max)]
}
}
if(!is.null(submodels)) {
tmp <- vector(mode = "list", length = nrow(submodels) + 1)
tmp[[1]] <- out
for(j in seq(along = submodels$nrounds)) {
tmp_pred <- predict(modelFit, newdata, ntreelimit = submodels$nrounds[j])
if(modelFit$problemType == "Classification") {
if(length(modelFit$obsLevels) == 2) {
tmp_pred <- ifelse(tmp_pred >= .5,
modelFit$obsLevels[1],
modelFit$obsLevels[2])
} else {
tmp_pred <- matrix(tmp_pred, ncol = length(modelFit$obsLevels), byrow = TRUE)
tmp_pred <- modelFit$obsLevels[apply(tmp_pred, 1, which.max)]
}
}
tmp[[j+1]] <- tmp_pred
}
out <- tmp
}
out
},
prob = function(modelFit, newdata, submodels = NULL) {
newdata <- xgb.DMatrix(as.matrix(newdata))
out <- predict(modelFit, newdata)
if(length(modelFit$obsLevels) == 2) {
out <- cbind(out, 1 - out)
colnames(out) <- modelFit$obsLevels
} else {
out <- matrix(out, ncol = length(modelFit$obsLevels), byrow = TRUE)
colnames(out) <- modelFit$obsLevels
}
out <- as.data.frame(out)
if(!is.null(submodels)) {
tmp <- vector(mode = "list", length = nrow(submodels) + 1)
tmp[[1]] <- out
for(j in seq(along = submodels$nrounds)) {
tmp_pred <- predict(modelFit, newdata, ntreelimit = submodels$nrounds[j])
if(length(modelFit$obsLevels) == 2) {
tmp_pred <- cbind(tmp_pred, 1 - tmp_pred)
colnames(tmp_pred) <- modelFit$obsLevels
} else {
tmp_pred <- matrix(tmp_pred, ncol = length(modelFit$obsLevels), byrow = TRUE)
colnames(tmp_pred) <- modelFit$obsLevels
}
tmp_pred <- as.data.frame(tmp_pred)
tmp[[j+1]] <- tmp_pred
}
out <- tmp
}
out
},
predictors = function(x, ...) {
imp <- xgb.importance(x$xNames, model = x)
x$xNames[x$xNames %in% imp$Feature]
},
varImp = function(object, numTrees = NULL, ...) {
imp <- xgb.importance(x$xNames, model = x)
imp <- as.data.frame(imp)[, 1:2]
rownames(imp) <- as.character(imp[,1])
imp <- imp[,2,drop = FALSE]
colnames(imp) <- "Overall"
imp
},
levels = function(x) x$obsLevels,
tags = c("Tree-Based Model", "Boosting", "Ensemble Model", "Implicit Feature Selection"),
sort = function(x) {
# This is a toss-up, but the # trees probably adds
# complexity faster than number of splits
x[order(x$nrounds, x$max_depth, x$eta),]
})
Is there any recommended way to add a new caret model to existing model_list?
Like, I have a model_list, and I want to add a new model:
I can do
model9 <- train(X[train,], Y[train], method='gam', trControl=myControl)
Then,
model_list$gam = model9
Is this right?
Yup!
@ajing I had the same issue. I have a dirty workaround that works for me. The function 'makePredObsMatrix' in the caretEnsemble package is throwing the error. Below is the source code of the function with the initial two lines commented (these are just checks. I assume if you know what you're doing, uncommenting this is fine - perhaps @zachmayer can throw more light). I sourced the modified function and added the following lines to my script. It worked. Hope it helps.
source('/media/3TB/kag/caretEnsemble-master/R/helper_functions.R')
assignInNamespace("makePredObsMatrix",makePredObsMatrix, ns="caretEnsemble")
####### ######modified the function in helper_function.R file from caretEnsemble repo
makePredObsMatrix <- function(list_of_models){
#caretList Checks
#check_caretList_classes(list_of_models)
#check_caretList_model_types(list_of_models)
#Make a list of models
modelLibrary <- extractBestPreds(list_of_models)
model_names <- names(modelLibrary)
#Model library checks
check_bestpreds_resamples(modelLibrary) #Re-write with data.table?
check_bestpreds_indexes(modelLibrary) #Re-write with data.table?
check_bestpreds_obs(modelLibrary) #Re-write with data.table?
check_bestpreds_preds(modelLibrary) #Re-write with data.table?
#Extract model type (class or reg)
type <- extractModelTypes(list_of_models)
#Add names column
for(i in seq_along(modelLibrary)){
set(modelLibrary[[i]], j="modelname", value=names(modelLibrary)[[i]])
}
#Remove parameter columns
keep <- Reduce(intersect, lapply(modelLibrary, names))
for(i in seq_along(modelLibrary)){
rem <- setdiff(names(modelLibrary[[i]]), keep)
if(length(rem) > 0){
for(r in rem){
set(modelLibrary[[i]], j=r, value=NULL)
}
}
}
modelLibrary <- rbindlist(modelLibrary, fill=TRUE)
#For classification models that produce probs, use the probs as preds
#Otherwise, just use class predictions
if (type=="Classification"){
positive <- as.character(unique(modelLibrary$obs)[2]) #IMPROVE THIS!
pos <- as.numeric(modelLibrary[[positive]])
good_pos_values <- which(is.finite(pos))
set(modelLibrary, j="pred", value=as.numeric(modelLibrary[["pred"]]))
set(modelLibrary, i=good_pos_values, j="pred", value=modelLibrary[good_pos_values,positive,with=FALSE])
}
#Reshape wide for meta-modeling
modelLibrary <- data.table::dcast.data.table(
modelLibrary,
obs + rowIndex + Resample ~ modelname,
value.var = "pred"
)
#Return
return(list(obs=modelLibrary$obs, preds=as.matrix(modelLibrary[,model_names,with=FALSE]), type=type))
}
@tsuresh83 Could you make a PR on github with the new version of your function so I can see what you changed? If the fix looks good to me and @jknowles I'll merge it into master.
@zachmayer I am afraid this is not a fix. From what I've seen the actual fix is the addition of an entry to the list of models in 'models.RData' from the caret package. What I suggested is only a stop gap measure until the actual fix.
Has there been any progress or workarounds developed for using caretStack for multi-class classfication?
Nope, but I welcome pull requests!
At the moment we're in the middle of some code cleanup and bug fixes, which will probably be followed by a CRAN release. Then I'll figure out how to get caretList and caretStack to work with multiclass.
Getting caretEnsemble to work with multiclass will be harder.
Thanks for the quick response @zachmayer! By the way, what is your reference material for the implementation of Stacking? Are you using "Elements of Statistical Learning" or something?