caret icon indicating copy to clipboard operation
caret copied to clipboard

GAM models in caret do not support Poisson family models

Open nickreich opened this issue 10 years ago • 6 comments
trafficstars

From what I can see in the model code for the GAM models (gam, gamSpline, gamLoess), there is no support for Poisson outcomes, only binomial and gaussian. It would be great to add support for these types of outcomes (I coded up a kludgy custom caret model, if I can spruce it up, I might submit a pull request). But at the very least it should be explicitly stated somewhere in the documentation and/or on the caret website that it is impossible to specify a distribution family for these types of models. It took me a lot of error tracing, and then digging into the codebase for caret to realize that the "family" argument was being hard coded based on the class of "y". Here is a simple illustration that shows how behavior that works for GLMs does not work for GAMs:

dat <- data.frame(label=round(rpois(100,20)),v1=rnorm(100),v2=rnorm(100))
tc <- trainControl("cv",10,savePred=T)
(fit <- train(label~.,data=dat,method="glm",trControl=tc,family=poisson(link = "log")))

(fit1 <- train(label~.,data=dat,method="gam",trControl=tc,family=poisson(link = "log")))

nickreich avatar Mar 17 '15 00:03 nickreich

Good point. I think GBM sometimes has similar problems when the user-specified distribution conflicts with the default chosen by GBM.

I think we need to clean up the code a little for both of these models.

If you want to submit a PR, update the code in this file: https://github.com/topepo/caret/blob/master/models/files/gam.R

zachmayer avatar Mar 17 '15 13:03 zachmayer

Basically, we need to intercept the family argument. Here is a prototype of the changes for method = "gam". I'm testing it and if it works fine will apply the same changes to the other GAM functions and GBM:

modelInfo <- list(label = "Generalized Additive Model using Splines",
                  library = "mgcv",
                  loop = NULL,
                  type = c('Regression', 'Classification'),
                  parameters = data.frame(parameter = c('select', 'method'),
                                          class = c('logical', 'character'),
                                          label = c('Feature Selection', 'Method')),
                  grid = function(x, y, len = NULL) 
                    expand.grid(select = c(TRUE, FALSE), method = "GCV.Cp"),
                  fit = function(x, y, wts, param, lev, last, classProbs, ...) { 
                    dat <- if(is.data.frame(x)) x else as.data.frame(x)
                    modForm <- caret:::smootherFormula(x)
                    if(is.factor(y)) {
                      dat$.outcome <- ifelse(y == lev[1], 1, 0)
                      dist <- binomial()
                    } else {
                      dat$.outcome <- y
                      dist <- gaussian()
                    }
                    modelArgs <- list(formula = modForm,
                                      data = dat,
                                      select = param$select, 
                                      method = as.character(param$method))
                    ## Intercept family if passed in
                    theDots <- list(...)
                    if(!any(names(theDots) == "family")) modelArgs$family <- dist
                    modelArgs <- c(modelArgs, theDots)

                    out <- do.call(getFromNamespace("gam", "mgcv"), modelArgs)
                    out

                  },
                  predict = function(modelFit, newdata, submodels = NULL) {
                    if(!is.data.frame(newdata)) newdata <- as.data.frame(newdata)
                    if(modelFit$problemType == "Classification") {
                      probs <-  predict(modelFit, newdata, type = "response")
                      out <- ifelse(probs < .5,
                                    modelFit$obsLevel[1],
                                    modelFit$obsLevel[2])
                    } else {
                      out <- predict(modelFit, newdata, type = "response")
                    }
                    out
                  },
                  prob = function(modelFit, newdata, submodels = NULL){
                    if(!is.data.frame(newdata)) newdata <- as.data.frame(newdata)
                    out <- predict(modelFit, newdata, type = "response")
                    out <- cbind(1-out, out)
                    ## glm models the second factor level, we treat the first as the
                    ## event of interest. See Details in ?glm
                    colnames(out) <-  modelFit$obsLevels
                    out
                  },
                  predictors = function(x, ...) {
                    predictors(x$terms)
                  },
                  varImp = function(object, ...) {
                    smoothed <- summary(object)$s.table[, "p-value", drop = FALSE]
                    linear <- summary(object)$p.table
                    linear <- linear[, grepl("^Pr", colnames(linear)), drop = FALSE] 
                    gams <- rbind(smoothed, linear)
                    gams <- gams[rownames(gams) != "(Intercept)",,drop = FALSE]
                    rownames(gams) <- gsub("^s\\(", "", rownames(gams))
                    rownames(gams) <- gsub("\\)$", "", rownames(gams))
                    colnames(gams)[1] <- "Overall"
                    gams <- as.data.frame(gams)
                    gams$Overall <- -log10(gams$Overall)
                    allPreds <- colnames(attr(object$terms,"factors"))
                    extras <- allPreds[!(allPreds %in% rownames(gams))]
                    if(any(extras)) {
                      tmp <- data.frame(Overall = rep(NA, length(extras)))
                      rownames(tmp) <- extras
                      gams <- rbind(gams, tmp)
                    }
                    gams
                  },
                  tags = c("Generalized Linear Model", "Generalized Additive Model"),
                  sort = function(x) x)

topepo avatar Apr 01 '15 19:04 topepo

I just checked in new versions of model code for gbm, gam, gamLoess and gamSpline along with new regression tests. Check them out when you get a chance. I'll close this for now.

topepo avatar Apr 02 '15 17:04 topepo

Thanks for the fix. Hoping to find time to try this out soon.

nickreich avatar Apr 04 '15 18:04 nickreich

I just tried you unpublished glmnet_h2o. It also needs similar distribution family support, otherwise it can only run binomial and gaussian.

wenjiec avatar Oct 21 '16 13:10 wenjiec

Got glmnet_h2o fixed with the distribution family support by following the threads above.

The other suggestion is that each iteration would convert the same data.frame to h20frame by using as.h2o function, which takes a long time. You may think about some way to convert the data.frame only once.

fit = function(x, y, wts, param, lev, last, classProbs, ...) {
  dat <- if(!is.data.frame(x)) as.data.frame(x) else x
  dat$.outcome <- y

  args <- list(x = colnames(x),
               alpha = param$alpha, 
               lambda = param$lambda)

  args$training_frame <- as.h2o(dat, destination_frame = paste0("tmp_glmnet_dat_",sample.int(100000, 1)))
  args$y <- ".outcome"

  theDots <- list(...)

  if(!any(names(theDots) == "family")) 
    args$family <- ifelse(is.factor(y), "binomial", "gaussian")

  if(length(theDots) > 0) args <- c(args, theDots)

  out <- do.call(getFromNamespace("h2o.glm", "h2o"), args)
  h2o.getModel(out@model_id) 
},

wenjiec avatar Oct 21 '16 20:10 wenjiec