featureImportance icon indicating copy to clipboard operation
featureImportance copied to clipboard

Cannot reproduce measure (delta L) values of featureImportance for individual observations outside of featureImportance

Open kransom14 opened this issue 5 years ago • 3 comments

I would like to get the local feature importance of the ith observation for individual variables as described in Casalicchio et al. section 4 lines 7-8 by integrating the ICI curve for the ith observation however, I am unsure how to interpret the negative delta L values reported by the function and cannot reproduce them outside of the featureImportance() function. Could this be an issue with not using mlr package? Or am I not interpreting the delta L value correctly?

Below I set up a gbm model like in this issue: https://github.com/giuseppec/featureImportance/issues/5

library(DALEX)
library(featureImportance)
library(gbm)
library(pracma)
library(tidyverse)

mod <- gbm(m2.price~., data = apartments)
summary(mod)

pred.fun <- function(object, newdata){
  preds <- predict(mod, newdata, n.trees = 100)
}

# you need to define the performance measure, I use the rmse here
measure <- function(truth, response)
  sqrt(mean((truth-response)^2))

# calculate the local permutation feature importance for all features
imp <- featureImportance(mod, data = apartments, target = "m2.price", n.feat.perm = 100, 
                         local = TRUE, predict.fun = pred.fun, measures = list(rmse=measure))

# save permutation importance to a data frame
imp_df <- imp$importance

# get the values for the surface feature only
surface <- filter(imp_df, features == "surface"
# get the values for the surface feature only for the 23rd observation, which has negative reported measures (change in rmse?)
surface_23 <- filter(surface, row.id == 23) %>% mutate(feature.value = as.numeric(as.character(feature.value))) %>% arrange(feature.value) # need to convert feature.value to numeric
# change in rmse for observation 23
ggplot(surface_23, aes(x = feature.value, y = rmse)) + geom_line() # I would like to integrate this curve to get a summary measure of the local feature importance for this observation

# try manually calculating the rmse and change in rmse for a prediction 
apartments[23,]
predict(mod, apartments[23,], n.trees = 100)
rmse(sim = 4776.9, obs = 5170)
new <- apartments[23,]
new$surface <- 50 # replace the surface variable value with the permuted value for permutation 66
predict(mod, new, n.trees = 100)
rmse(sim = 5624.19, obs = 5170)
delL <- 393.1-454.19 # this should be the value described in Casalicchio et al. section 4 lines 1-6 correct?
delL # but feature importance function reports -44.26 for this case

kransom14 avatar Jul 11 '19 18:07 kransom14

Ha! Took me some time to figure out but the issue is rather a (strange) behaivour of gbm's prediction function in the case of predicting only single observations, look for example at the result of this code:

set.seed(1)
mod = gbm(m2.price~., data = apartments)
# make predictions for all data points
p = predict(mod, newdata = apartments, n.trees = 100)
# extract prediction of 23rd observation
p[23]

# make predictions only for 23rd observation, is different than the previous value
predict(mod, apartments[23, ], n.trees = 100)

Here we would expect the same prediction right? But it's not. Different algorithms (e.g., random forest) do not suffer from this issue! Anyway, this is why in your case you don't get the same results (since you manually predict single observations and the featureImportance function internally does not). If you replace gbm by randomForest in your code, it should be fine.

Here a modification of your code that will also work with gbm (I basically predict on all observations and extract the 23rd prediction afterwards rather than making predictions only for the 23rd observation):

library(DALEX)
library(featureImportance)
library(gbm)
library(pracma)
library(tidyverse)

mod <- gbm(m2.price~., data = apartments)
summary(mod)

pred.fun <- function(object, newdata){
  preds <- predict(mod, newdata, n.trees = 100)
}

# you need to define the performance measure, I use the rmse here
rmse_measure <- function(truth, response)
  sqrt(mean((truth-response)^2))

set.seed(1)
# calculate the local permutation feature importance for all features
imp <- featureImportance(mod, data = apartments, target = "m2.price", n.feat.perm = 100, 
  local = TRUE, predict.fun = pred.fun, measures = list(rmse=rmse_measure))

# save permutation importance to a data frame
imp_df <- imp$importance

# get the values for the surface feature only
surface <- filter(imp_df, features == "surface")
# get the values for the surface feature only for the 23rd observation, which has negative reported measures (change in rmse?)
surface_23 <- filter(surface, row.id == 23) %>% mutate(feature.value = as.numeric(as.character(feature.value))) %>% arrange(feature.value) # need to convert feature.value to numeric
# change in rmse for observation 23
ggplot(surface_23, aes(x = feature.value, y = rmse)) + geom_line() # I would like to integrate this curve to get a summary measure of the local feature importance for this observation

# try manually calculating the rmse and change in rmse for a prediction 
i = 23 # choose 23rd observation
f = predict(mod, apartments, n.trees = 100)
f_i = f[23]
y_i = apartments$m2.price[i]
loss_i = rmse_measure(y_i, f_i)
loss_i

new <- apartments
new[23, "surface"] <- 50
f = predict(mod, new, n.trees = 100) # we need to make predictions for all data and extract the 23rd
f_replaced = f[23]
loss_replaced = rmse_measure(y_i, f_replaced)
loss_replaced

delL <- loss_replaced - loss_i # this should be the value described in Casalicchio et al. section 4 lines 1-6 correct?
delL # compare this value with the plot above, should be fine now

Another thing you should keep in mind: Currently, you are using the training data to calculate the feature importance (which I feel is as unusual as using the training data to measure the performance of a machine learning model, especially because the feature importance is based on measuring the performance as well). So be careful with interpreting your results! I'd rather suggest using a separate test data to measure the feature importance (e.g., by holdout or repeated holdout). If you use mlr this will be much easier (see previous github issue).

Let me know if this answer helps.

giuseppec avatar Jul 12 '19 21:07 giuseppec

Thanks! Yes, this is very helpful. I appreciate your spending time on this and your interesting paper and package.

When I predict to a single observation with gbm I get the same result as when I predict to all the observations. I wonder if it's a version thing? I am using gbm 2.1.5. It might be a bug in gbm.

Either way, I followed your example above and I can match the measure values (change in rmse) now.

So the testing data is better for permutation importance? I am used to the internal importance metrics specific to each method which rely on the trained model and the training data. But I'd like to move to model agnostic methods like permutation importance.

My goal here is to see if different features are more or less important for single observations. I planned to integrate the ICI curves for each observation and then see which variable has the largest value for each observation, but I am not sure the integrals are comparable because the features all have different ranges. Please let me know if you have suggestions on this approach.

kransom14 avatar Jul 12 '19 23:07 kransom14

Hi, using test data to compute the permutation importance just feels more natural. You can still do this to get importance values for individual observations in your available data. E.g., if you split your data using 3 fold crossvalidation, you can train your model on the training data and assess the importance of individual observations that are included in the test data. If you do this for each fold, you end up with (unbiased) importance values for each single observation which were computed on helt-out test data. If you would do this on train data, the importance values will be biased (i.e., overly optimistic just for the same reason as to why the training error is an overly optimistic estimate for the generalization performance).

Although features have different scales, their corresponding local feature importance will be on the same scale. So you should be able to do what you described. I am working on something similar. Let me know if you further information.

giuseppec avatar Jul 25 '19 05:07 giuseppec