C5.0
C5.0 copied to clipboard
Error when plotting a C5.0 tree with factors which have spaces in levels
Hello, please see the below code which reproduces a bug in handling the levels of factors with spaces in the name when plotting C5.0 trees. This is in R version 3.4.1 (2017-06-30) x86_64-w64-mingw32 on Windows 7:
library(C50)
data(mtcars)
#Let's add some factors
mtcars$cyl <- as.factor(mtcars$cyl)
mtcars$gear <- as.factor(mtcars$gear)
#Let's add some spaces to the factors
levels(mtcars$gear) <- c("3 speed", "4 speed" ,"5 speed")
myTree <- C5.0(cyl ~ gear, data=mtcars)
plot(myTree)
Error in partysplit(varid = as.integer(i), index = index, info = k, prob = NULL) :
minimum of ‘index’ is not equal to 1
The error itself is due to NA values being passed in the index vector. The root cause is probably that the factor levels are being split on spaces, but I'm unable to trace exactly where. On line 212 of as.party.C5.0.R, the for loop which generates the index value throws NA's because the factor levels stored in a1s do not match the factor levels in xlev.
There is a similar issue with the same error, but a different root cause, which can be traced to the model.frame.C5.0
function. On line 29 of the file as.party.C5.0.R
, drop.unused.levels
is set to TRUE
. In my production code, my decision tree winds up referring to levels which are dropped from the model frame. This causes the same issue with NA's being passed to partysplit. I've not opened a separate report for this because I am unable to generate a trivial data set to reproduce it. Do you recall why that flag is not set to FALSE
?
This should be fixed in the github version (0.1.1.9000) if you would like to test.
I've also changed
mf$drop.unused.levels <- FALSE
for testing
Hi, the plotting is good, except it reverses the order of my levels in the plot.
library(C5)
iris$Y=factor(ifelse(iris$Species=='setosa','Y','N'))
levels(iris$Y)
model=C5.0(Y~Sepal.Length,data=iris,rules=F)
plot(model)
stepping through the code, it seems that in partykit:::plot.party
, when the function terminal_node
is defined, an argument reverse
is set to TRUE
.
That appears to be how partkit
does things. Try:
library(partykit)
mod <- ctree(Y~Sepal.Length,data=iris)
plot(mod)