embarcadero icon indicating copy to clipboard operation
embarcadero copied to clipboard

categorical variables mishandled by `varimp`, `variable.step`, `bart.step`

Open AMBarbosa opened this issue 2 years ago • 4 comments

When the predictors include categorical variables, dbarts::bart includes them but embarcadero removes them. This appears to be because embarcadero removes variables based on unlist(attr(model$fit$data@x, "drop")), where the categorical variables are actually split and renamed to reflect their categories. This leads to an error in varimp (which fails for models that include categorical predictors) and to categorical variables being automatically excluded a priori by variable.step and bart.step, with a message unfairly blaming dbarts. Here some reproducible code:

# generate some data as in ?bart examples:

f <- function(x) {
  10 * sin(pi * x[,1] * x[,2]) + 20 * (x[,3] - 0.5)^2 +
    10 * x[,4] + 5 * x[,5]
}

set.seed(99)
sigma <- 1.0
n     <- 100

x  <- matrix(runif(n * 10), n, 10)
Ey <- f(x)
y  <- rnorm(n, Ey, sigma)


# make 'y' binary:
y <- ifelse(y > mean(y), 1, 0)

# make one of the x variables categorical:
x <- data.frame(x)
x[,1] <- ifelse(x[,1] > mean(x[,1]), "high", "low")
head(x)


# fit a bart model:
set.seed(99)
bartFit <- bart(x, y, keeptrees = TRUE)

summary(bartFit)  # notice 10 variables (i.e. including the categorical one) in predictor list

bartFit$fit$data
unlist(attr(bartFit$fit$data@x, "drop"))  # notice X1 (categorical variable) named here as X11 and X12 (one for each category)
# X11 X12  X2  X3  X4  X5  X6  X7  X8  X9 X10 
#  52  48   0   0   0   0   0   0   0   0   0 

# attempt to compute variable importance with 'embarcadero':
varimp(bartFit)  # Error in data.frame(names, varimps) : arguments imply differing number of rows: 9, 10

# but the variable importance info is there, including for the categorical variable (though it's also renamed here):
rel_imp <- bartFit$varcount / rowSums(bartFit$varcount)
colnames(rel_imp)
# [1] "X1.low"   "X2"     "X3"     "X4"     "X5"     "X6"     "X7"     "X8"     "X9"     "X10"

# attempt to simplify the model with 'embarcadero':
variable.step(x, y)  # X1 (categorical variable) said to be dropped by 'dbarts', but it wasn't really -- it was dropped by 'embarcadero' when expecting unlist(attr(bartFit$fit$data@x, "drop")) to have the original variables' names

AMBarbosa avatar Dec 05 '22 11:12 AMBarbosa

I can attempt to fix this and submit a pull request for your consideration once I've managed to.

AMBarbosa avatar Dec 14 '22 14:12 AMBarbosa

I can attempt to fix this and submit a pull request for your consideration once I've managed to.

Wondering if you managed to fix this? I'd be interested, it'd be very appreciated @AMBarbosa

charleygros avatar Jan 04 '23 04:01 charleygros

I'm working on it. I forked 'embarcadero' and if you install my branch with install_github('AMBarbosa/embarcadero') (function 'install_github' from 'devtools' or from 'remotes' pkg) you can try it out already -- I'd actually appreciate some feedback on how it's working. I still haven't finished testing and adapting this also to 'rbart' models.

AMBarbosa avatar Jan 04 '23 11:01 AMBarbosa

@AMBarbosa : many thanks for this. I installed your branch and tested it on my data: the results all made sense to me and the functions worked as expected. I haven't look at the changes per se tho. Great one

charleygros avatar Jan 12 '23 00:01 charleygros