xspliner icon indicating copy to clipboard operation
xspliner copied to clipboard

Error in `[.data.frame`(data, , x) : undefined columns selected

Open pkdism opened this issue 5 years ago • 6 comments

Hi, I'm trying to run xspline and getting this error on passing a simple ranger model to xspline(). I'm suspecting that the issue might be there with this call (this is the only place where data is subsetted): classes <- get_predictors_classes(data[, predictors])

Here is a reproducible example:

> temp_data <- data.frame(a = rnorm(n = 1000), b = rnorm(n = 1000), y = sample(x = c(TRUE, FALSE), size = 1000, replace = TRUE))
> model_rf <- ranger::ranger(formula = factor(y) ~ ., data = temp_data, min.node.size = 5, importance = "permutation", probability = TRUE)
> xs <- xspliner::xspline(model_rf, data = temp_data)
Error in `[.data.frame`(data, , x) : undefined columns selected
> explainer_rf <- DALEX::explain(model_rf, y = temp_data$y, data = temp_data)
> xs <- xspliner::xspline(explainer_rf, data = temp_data)
Error in `[.data.frame`(data, , x) : undefined columns selected

pkdism avatar Sep 03 '19 06:09 pkdism

@pkdism Thank you for noticing that issue. I'll take a look at it probably tomorrow.

krystian8207 avatar Sep 03 '19 13:09 krystian8207

@pkdism Please install latest version of xspliner from github, I fixed that some time ago. In order to build the model please use:

temp_data <- data.frame(a = rnorm(n = 1000), b = rnorm(n = 1000), y = factor(sample(x = c(TRUE, FALSE), size = 1000, replace = TRUE)))
model_rf <- ranger::ranger(formula = y ~ ., data = temp_data, min.node.size = 5, importance = "permutation", probability = TRUE)
xs <- xspliner::xspline(model_rf, lhs = "y", response = "y", data = temp_data)

So two things has changed in the code:

  1. Providing lhs = "y", response = "y". Usually xspliner automatically extracts the parameters from the model. In case of ranger it is stored differently so you need to pass it manually (I'll extend it to ranger objects soon).
  2. Converting y to factor in original dataset. xspliner detects classification problems automatically based on type of response variable in dataset. In this case it wasn't factor. Alternatively you can pass: lhs = "factor(y)", response = "y" and family = binomial() to define classification problem (like you were using stats::glm).

I hope it helps. Please let me know if it fixed your problem.

krystian8207

krystian8207 avatar Sep 03 '19 13:09 krystian8207

@krystian8207 Thank you. It is working on the above example, but not working on actual data. I am a similar setup setup. Actual data is of class data.frame and has predictors of class factor, numeric, integer and logical. There is no NA. The target variable is a factor with 2 levels. I'm getting following error on calling xspliner::xspline(model_obj, lhs = "my_target_var", response = "my_target_var", data = my_data):

Error in .f(.x[[i]], .y[[i]], ...) : Wrong class passed.

I hope it helps.

pkdism avatar Sep 04 '19 10:09 pkdism

@pkdism I'm grateful for reporting this problem. Apparently xspliner was not allowing logical variables within the data, but stats::glm does. I allowed passing logical variables as well now, but, by model logic, they are not allowed to be transformed (both with spline or merging variable levels). Please install the latest version from Github, I hope it works well now.

Note: Please be aware of having integer values in dataset. When such variable has only a few values it might not be possible to approximate it with a spline. In this case, it's worth to exclude such variable from transformation by specifying bare variable (please check here).

krystian8207 avatar Sep 04 '19 20:09 krystian8207

@krystian8207 I'm getting following error now. The setup remains same.

Error in smooth.construct.tp.smooth.spec(object, dk$data, dk$knots) : 
  A term has fewer unique covariate combinations than specified maximum degrees of freedom

pkdism avatar Sep 10 '19 04:09 pkdism

Hi. Seems like some of the variables cannot be approximated with splines. In such cases, its recommended to not transform them (with bare parameter). One difficulty can be to detect which variable causes errors. In order to make it easier please set up special option that tells you more about taken steps while building the model:

options("xspliner.log" = TRUE)

Now you'll be available to see which variable caused an error.

krystian8207 avatar Sep 16 '19 00:09 krystian8207