C5.0 icon indicating copy to clipboard operation
C5.0 copied to clipboard

Error when plotting a C5.0 tree with factors which have spaces in levels

Open rakeshnbabu opened this issue 7 years ago • 5 comments

Hello, please see the below code which reproduces a bug in handling the levels of factors with spaces in the name when plotting C5.0 trees. This is in R version 3.4.1 (2017-06-30) x86_64-w64-mingw32 on Windows 7:

library(C50)
data(mtcars)
#Let's add some factors
mtcars$cyl <- as.factor(mtcars$cyl)
mtcars$gear <- as.factor(mtcars$gear)
#Let's add some spaces to the factors
levels(mtcars$gear) <- c("3 speed", "4 speed" ,"5 speed")

myTree <- C5.0(cyl ~ gear, data=mtcars)
plot(myTree)

Error in partysplit(varid = as.integer(i), index = index, info = k, prob = NULL) : 
  minimum of ‘index’ is not equal to 1

The error itself is due to NA values being passed in the index vector. The root cause is probably that the factor levels are being split on spaces, but I'm unable to trace exactly where. On line 212 of as.party.C5.0.R, the for loop which generates the index value throws NA's because the factor levels stored in a1s do not match the factor levels in xlev.

rakeshnbabu avatar Aug 04 '17 19:08 rakeshnbabu

There is a similar issue with the same error, but a different root cause, which can be traced to the model.frame.C5.0 function. On line 29 of the file as.party.C5.0.R, drop.unused.levels is set to TRUE. In my production code, my decision tree winds up referring to levels which are dropped from the model frame. This causes the same issue with NA's being passed to partysplit. I've not opened a separate report for this because I am unable to generate a trivial data set to reproduce it. Do you recall why that flag is not set to FALSE?

rakeshnbabu avatar Aug 04 '17 20:08 rakeshnbabu

This should be fixed in the github version (0.1.1.9000) if you would like to test.

topepo avatar Feb 15 '18 18:02 topepo

I've also changed

mf$drop.unused.levels <- FALSE

for testing

topepo avatar Feb 15 '18 18:02 topepo

Hi, the plotting is good, except it reverses the order of my levels in the plot.

library(C5)
iris$Y=factor(ifelse(iris$Species=='setosa','Y','N'))
levels(iris$Y)
model=C5.0(Y~Sepal.Length,data=iris,rules=F)
plot(model)

stepping through the code, it seems that in partykit:::plot.party, when the function terminal_node is defined, an argument reverse is set to TRUE.

kohleth avatar Apr 23 '18 10:04 kohleth

That appears to be how partkit does things. Try:

library(partykit)
mod <- ctree(Y~Sepal.Length,data=iris)
plot(mod)

topepo avatar May 21 '18 13:05 topepo