caretEnsemble icon indicating copy to clipboard operation
caretEnsemble copied to clipboard

Multi-class classification greedy optimization

Open thomaskern opened this issue 11 years ago • 61 comments
trafficstars

i see that branch Dev has some more progress regarding multi-class classification ensemble stacking but unfortunately it is not yet done. do you plan on implementing this and/or could you point me in the right direction so i might be able to finish it? I don't seem to understand what the problem/holdup is (no offense intended)

thomaskern avatar Feb 27 '14 23:02 thomaskern

Currently jknowles is leading the charge on this. If you would like to add code, please feel free to submit pull requests on the dev branch.

zachmayer avatar Feb 28 '14 16:02 zachmayer

@thomaskern you can probably start by modifying the package to have a 3rd type of model type to identify -- "multiclass" or something like that. This would require modifying the checkModels... function in the utilities.R

After you do that, you just need to add methods for the "multiclass" type to the stack, ensemble, and predict functions and write some unit tests to make sure it is performing as you like.

What optimization function would you use in the multi-class classification setting?

jknowles avatar Feb 28 '14 16:02 jknowles

@jknowles I am less worried about the R code and more worried about the optimization function ;) I have no OF in mind and was hoping for some pointers. Have you given this any thought by chance?

thomaskern avatar Feb 28 '14 19:02 thomaskern

I have not. That's the reason I skipped multiclass models for now. It seemed non-trival to write a greedy selection function for them.

I think you should still be able to use another caret model for the ensembling though.

On Fri, Feb 28, 2014 at 2:56 PM, Thomas [email protected] wrote:

@jknowles https://github.com/jknowles I am less worried about the R code and more worried about the optimization function ;) I have no OF in mind and was hoping for some pointers. Have you given this any thought by chance?

Reply to this email directly or view it on GitHubhttps://github.com/zachmayer/caretEnsemble/issues/8#issuecomment-36388211 .

zachmayer avatar Feb 28 '14 19:02 zachmayer

This is also post 1.0. Once we have a working, stable release on CRAN, we'll figure out the multi-class RMSE and AUC optimization functions.

zachmayer avatar Nov 13 '14 20:11 zachmayer

We also need to add multiclass support to caretList and caretStack (caretStack might be easier to do), etc.

zachmayer avatar Dec 12 '14 15:12 zachmayer

Is there any quick way to hack the multiclass classification with greedy optimization? I am participating a kaggle competition...

ajing avatar May 17 '15 15:05 ajing

The quickest way would be to use caretStack with glmnet as the stacker.

In general, I find myself using caretStack more than caretList. It's a lot more flexible, and if you use method='glm' you'll get almost the exact same results as with a greedy ensemble.

Take a look at the vignette for some examples

zachmayer avatar May 19 '15 19:05 zachmayer

@ajing The real key is the optimization function. I haven't even thought of a hack that will work here.... Once someone's written the multiclass optimization function work, we can work on multiclass greedy ensembling.

zachmayer avatar May 19 '15 19:05 zachmayer

In methodList=c('rpart', 'nnet'), could I specify a new model as a variable? Because, for xgboost, caret hasn't include that yet, but they can specific a variable modelInfo for xgboost for train function. Could I do something like methodList = c(modelInfo, 'nnet')?

ajing avatar May 21 '15 20:05 ajing

We don't support that yet, but it's definitely on the to-do list.

For now, you can fit the model separately with a call to caret::train, and then add it to the list. To quote from the vignette (please do read it!):

Finally, you should note that caretList does not support custom caret models. Fitting those models are beyond the scope of this vignette, but if you do so, you can manually add them to the model list (e.g. model_list_big[['my_custom_model']] <- my_custom_model). Just be sure to use the same re-sampling indexes in trControl as you use in the caretList models!

zachmayer avatar May 21 '15 20:05 zachmayer

Sorry to ask the question again... For multiclass classification, is that possible to build the ensemble model? I got the error for caretStack with glmnet also...

stack_model <- caretStack(model_list, method='glmnet')
Error in check_caretList_model_types(list_of_models) : Not yet implemented for multiclass problems

ajing avatar May 21 '15 23:05 ajing

Darn.  I thought I'd fixed that.

Ok, multi class for caretStack should be easier implement than for caretEnsemble.  I'll take a look.

No promises on when I can get to it!

zachmayer avatar May 22 '15 00:05 zachmayer

For adding a new customized model, In the example: "Just be sure to use the same re-sampling indexes in trControl as you use in the caretList models!"

What portion I need to change in modelInfo to use the same resampling indexes?

I use the modelInfo of xgboost provided by topepo.

ajing avatar Jul 24 '15 14:07 ajing

Lets say x is your other model (the one you're ensembling xgboost with):

new_train_control <- trainControl(index=x$control$index, indexOut=x$control$indexOut)

Use the new control to fit your xgboost model.

zachmayer avatar Jul 24 '15 15:07 zachmayer

Thanks! I was thinking about still using caretEnsemble framework. Can I do something like:

my_control <- trainControl(
  method='boot',
  number=25,
  classProbs=TRUE,
  index=createResample(training$Class, 25),
  summaryFunction=twoClassSummary
)

model_list <- caretList(
  Class~., data=training,
  trControl=my_control,
  tuneList=list(
    rf1=caretModelSpec(method='rf', tuneGrid=data.frame(.mtry=2)),
    xgb = caretModelSpec(method=modelInfo, tuneLength = 3)
  )

ajing avatar Jul 24 '15 15:07 ajing

As explained in the vignette: "you should note that caretList does not support custom caret models"

You have to fit the second model with a second call to train. E.g.:

my_control <- trainControl(
  method='boot',
  number=25,
  classProbs=TRUE,
  index=createResample(training$Class, 25),
  summaryFunction=twoClassSummary
)

model_list <- caretList(
  Class~., data=training,
  trControl=my_control,
  tuneList=list(
    rf1=caretModelSpec(method='rf', tuneGrid=data.frame(.mtry=2))
  )

new_model <- train(
  Class~., 
  data=training, 
  trControl=my_control, 
  method=modelInfo, 
  tuneLength = 3)

model_list$xgb <- new_model

zachmayer avatar Jul 24 '15 15:07 zachmayer

Thanks for the code! Yes, it works now.

ajing avatar Jul 24 '15 17:07 ajing

Good! Glad to help

zachmayer avatar Jul 24 '15 17:07 zachmayer

Sorry to bother again.. I got the following error, because some assumption about method variable..

model_list$xgb <- new_model greedy_ensemble <- caretEnsemble(model_list) Error in paste("Model '", method, "' is not in the ", "set of existing models", : object 'method' not found

ajing avatar Jul 24 '15 18:07 ajing

crap. that's a bug. If you can make a simple test case, I'll add it to the queue to fix.

zachmayer avatar Jul 24 '15 18:07 zachmayer

This could be a test case:

training = data.frame(matrix(runif(120),nrow=20,ncol=6), Class = factor(sample(c("Yes", "No"), 20, replace=T)))
my_control <- trainControl(
  method='boot',
  number=5,
  savePredictions=TRUE,
  classProbs=TRUE,
  index=createResample(training$Class, 5),
  summaryFunction=twoClassSummary
)
model_list <- caretList(
  Class~., data=training,
  trControl=my_control,
  methodList=c('glm', 'rpart')
)
new_model <- train(Class~., data=training, trControl=my_control, method=modelInfo, metric = "ROC", tuneGrid = expand.grid(eta = 0.01,max_depth = 5, nrounds=100))
model_list$xgb <- new_model

greedy_ensemble <- caretEnsemble(model_list)

modelInfo Is

modelInfo <- list(label = "eXtreme Gradient Boosting",
                  library = c("xgboost", "plyr"),
                  type = c("Regression", "Classification"),
                  parameters = data.frame(parameter = c('nrounds', 'max_depth', 'eta'),
                                          class = rep("numeric", 3),
                                          label = c('# Boosting Iterations', 'Max Tree Depth', 
                                                    'Shrinkage')),
                  grid = function(x, y, len = NULL) expand.grid(max_depth = seq(1, len),
                                                                nrounds = floor((1:len) * 50),
                                                                eta = .3),
                  loop = function(grid) {     
                    loop <- ddply(grid, c("eta", "max_depth"),
                                  function(x) c(nrounds = max(x$nrounds)))
                    submodels <- vector(mode = "list", length = nrow(loop))
                    for(i in seq(along = loop$nrounds)) {
                      index <- which(grid$max_depth == loop$max_depth[i] & 
                                       grid$eta == loop$eta[i])
                      trees <- grid[index, "nrounds"] 
                      submodels[[i]] <- data.frame(nrounds = trees[trees != loop$nrounds[i]])
                    }    
                    list(loop = loop, submodels = submodels)
                  },
                  fit = function(x, y, wts, param, lev, last, classProbs, ...) { 
                    if(is.factor(y)) {
                      if(length(lev) == 2) {
                        y <- ifelse(y == lev[1], 1, 0) 
                        dat <- xgb.DMatrix(as.matrix(x), label = y)
                        out <- xgb.train(list(eta = param$eta, 
                                              max_depth = param$max_depth), 
                                         data = dat,
                                         nrounds = param$nrounds,
                                         objective = "binary:logistic",
                                         ...)
                      } else {
                        y <- as.numeric(y) - 1
                        dat <- xgb.DMatrix(as.matrix(x), label = y)
                        out <- xgb.train(list(eta = param$eta, 
                                              max_depth = param$max_depth), 
                                         data = dat,
                                         num_class = length(lev),
                                         nrounds = param$nrounds,
                                         objective = "multi:softprob",
                                         ...)
                      }     
                    } else {
                      dat <- xgb.DMatrix(as.matrix(x), label = y)
                      out <- xgb.train(list(eta = param$eta, 
                                            max_depth = param$max_depth), 
                                       data = dat,
                                       nrounds = param$nrounds,
                                       objective = "reg:linear",
                                       ...)
                    }
                    out


                  },
                  predict = function(modelFit, newdata, submodels = NULL) {
                    newdata <- xgb.DMatrix(as.matrix(newdata))
                    out <- predict(modelFit, newdata)
                    if(modelFit$problemType == "Classification") {
                      if(length(modelFit$obsLevels) == 2) {
                        out <- ifelse(out >= .5, 
                                      modelFit$obsLevels[1], 
                                      modelFit$obsLevels[2])
                      } else {
                        out <- matrix(out, ncol = length(modelFit$obsLevels), byrow = TRUE)
                        out <- modelFit$obsLevels[apply(out, 1, which.max)]
                      }
                    }

                    if(!is.null(submodels)) {
                      tmp <- vector(mode = "list", length = nrow(submodels) + 1)
                      tmp[[1]] <- out
                      for(j in seq(along = submodels$nrounds)) {
                        tmp_pred <- predict(modelFit, newdata, ntreelimit = submodels$nrounds[j])
                        if(modelFit$problemType == "Classification") {
                          if(length(modelFit$obsLevels) == 2) {
                            tmp_pred <- ifelse(tmp_pred >= .5, 
                                               modelFit$obsLevels[1], 
                                               modelFit$obsLevels[2])
                          } else {
                            tmp_pred <- matrix(tmp_pred, ncol = length(modelFit$obsLevels), byrow = TRUE)
                            tmp_pred <- modelFit$obsLevels[apply(tmp_pred, 1, which.max)]
                          }
                        }
                        tmp[[j+1]]  <- tmp_pred
                      }
                      out <- tmp
                    }
                    out  
                  },
                  prob = function(modelFit, newdata, submodels = NULL) {
                    newdata <- xgb.DMatrix(as.matrix(newdata))
                    out <- predict(modelFit, newdata)
                    if(length(modelFit$obsLevels) == 2) {
                      out <- cbind(out, 1 - out)
                      colnames(out) <- modelFit$obsLevels
                    } else {
                      out <- matrix(out, ncol = length(modelFit$obsLevels), byrow = TRUE)
                      colnames(out) <- modelFit$obsLevels
                    }
                    out <- as.data.frame(out)

                    if(!is.null(submodels)) {
                      tmp <- vector(mode = "list", length = nrow(submodels) + 1)
                      tmp[[1]] <- out
                      for(j in seq(along = submodels$nrounds)) {
                        tmp_pred <- predict(modelFit, newdata, ntreelimit = submodels$nrounds[j])
                        if(length(modelFit$obsLevels) == 2) {
                          tmp_pred <- cbind(tmp_pred, 1 - tmp_pred)
                          colnames(tmp_pred) <- modelFit$obsLevels
                        } else {
                          tmp_pred <- matrix(tmp_pred, ncol = length(modelFit$obsLevels), byrow = TRUE)
                          colnames(tmp_pred) <- modelFit$obsLevels
                        }
                        tmp_pred <- as.data.frame(tmp_pred)
                        tmp[[j+1]]  <- tmp_pred
                      }
                      out <- tmp
                    }
                    out  
                  },
                  predictors = function(x, ...) {
                    imp <- xgb.importance(x$xNames, model = x)
                    x$xNames[x$xNames %in% imp$Feature]
                  },
                  varImp = function(object, numTrees = NULL, ...) {
                    imp <- xgb.importance(x$xNames, model = x)
                    imp <- as.data.frame(imp)[, 1:2]
                    rownames(imp) <- as.character(imp[,1])
                    imp <- imp[,2,drop = FALSE]
                    colnames(imp) <- "Overall"
                    imp   
                  },
                  levels = function(x) x$obsLevels,
                  tags = c("Tree-Based Model", "Boosting", "Ensemble Model", "Implicit Feature Selection"),
                  sort = function(x) {
                    # This is a toss-up, but the # trees probably adds
                    # complexity faster than number of splits
                    x[order(x$nrounds, x$max_depth, x$eta),] 
                  })

ajing avatar Jul 24 '15 18:07 ajing

Is there any recommended way to add a new caret model to existing model_list?

Like, I have a model_list, and I want to add a new model:

I can do

model9 <- train(X[train,], Y[train], method='gam', trControl=myControl)

Then,

model_list$gam = model9

Is this right?

ajing avatar Jul 27 '15 04:07 ajing

Yup!

zachmayer avatar Aug 01 '15 21:08 zachmayer

@ajing I had the same issue. I have a dirty workaround that works for me. The function 'makePredObsMatrix' in the caretEnsemble package is throwing the error. Below is the source code of the function with the initial two lines commented (these are just checks. I assume if you know what you're doing, uncommenting this is fine - perhaps @zachmayer can throw more light). I sourced the modified function and added the following lines to my script. It worked. Hope it helps.

source('/media/3TB/kag/caretEnsemble-master/R/helper_functions.R')
assignInNamespace("makePredObsMatrix",makePredObsMatrix, ns="caretEnsemble")

####### ######modified the function in helper_function.R file from caretEnsemble repo

makePredObsMatrix <- function(list_of_models){

  #caretList Checks
  #check_caretList_classes(list_of_models)
  #check_caretList_model_types(list_of_models)

  #Make a list of models
  modelLibrary <- extractBestPreds(list_of_models)
  model_names <- names(modelLibrary)

  #Model library checks
  check_bestpreds_resamples(modelLibrary) #Re-write with data.table?
  check_bestpreds_indexes(modelLibrary) #Re-write with data.table?
  check_bestpreds_obs(modelLibrary) #Re-write with data.table?
  check_bestpreds_preds(modelLibrary) #Re-write with data.table?

  #Extract model type (class or reg)
  type <- extractModelTypes(list_of_models)

  #Add names column
  for(i in seq_along(modelLibrary)){
    set(modelLibrary[[i]], j="modelname", value=names(modelLibrary)[[i]])
  }

  #Remove parameter columns
  keep <- Reduce(intersect, lapply(modelLibrary, names))
  for(i in seq_along(modelLibrary)){
    rem <- setdiff(names(modelLibrary[[i]]), keep)
    if(length(rem) > 0){
      for(r in rem){
        set(modelLibrary[[i]], j=r, value=NULL)
      }
    }
  }
  modelLibrary <- rbindlist(modelLibrary, fill=TRUE)

  #For classification models that produce probs, use the probs as preds
  #Otherwise, just use class predictions
  if (type=="Classification"){
    positive <- as.character(unique(modelLibrary$obs)[2]) #IMPROVE THIS!
    pos <- as.numeric(modelLibrary[[positive]])
    good_pos_values <- which(is.finite(pos))
    set(modelLibrary, j="pred", value=as.numeric(modelLibrary[["pred"]]))
    set(modelLibrary, i=good_pos_values, j="pred", value=modelLibrary[good_pos_values,positive,with=FALSE])
  }

  #Reshape wide for meta-modeling
  modelLibrary <- data.table::dcast.data.table(
    modelLibrary,
    obs + rowIndex + Resample ~ modelname,
    value.var = "pred"
  )

  #Return
  return(list(obs=modelLibrary$obs, preds=as.matrix(modelLibrary[,model_names,with=FALSE]), type=type))
}

tsuresh83 avatar Sep 17 '15 16:09 tsuresh83

@tsuresh83 Could you make a PR on github with the new version of your function so I can see what you changed? If the fix looks good to me and @jknowles I'll merge it into master.

zachmayer avatar Sep 17 '15 16:09 zachmayer

@zachmayer I am afraid this is not a fix. From what I've seen the actual fix is the addition of an entry to the list of models in 'models.RData' from the caret package. What I suggested is only a stop gap measure until the actual fix.

tsuresh83 avatar Sep 17 '15 16:09 tsuresh83

Has there been any progress or workarounds developed for using caretStack for multi-class classfication?

nithum avatar Oct 16 '15 12:10 nithum

Nope, but I welcome pull requests!

At the moment we're in the middle of some code cleanup and bug fixes, which will probably be followed by a CRAN release. Then I'll figure out how to get caretList and caretStack to work with multiclass.

Getting caretEnsemble to work with multiclass will be harder.

zachmayer avatar Oct 16 '15 12:10 zachmayer

Thanks for the quick response @zachmayer! By the way, what is your reference material for the implementation of Stacking? Are you using "Elements of Statistical Learning" or something?

nithum avatar Oct 16 '15 13:10 nithum