C5.0
C5.0 copied to clipboard
C5.0 fails with commas in input variables
C5.0() now fails on factor variables that include commas, where it did not before.
I recently updated my version of C50, and tried to train a model on a data set I've trained C5.0 models on before. I now receive the error "c50 code called exit with value 1". I narrowed it down to one factor variable that had commas in the values. After removing the commas, the model trained fine. Below is a small example I created to replicate the problem.
Thank you very much!
> ## PURPOSE: Replicate an error in C5.0 model training with commas
>
> # define 2 different data frame, one with commas
>
> # df no commas
> v1 = c(2, 3, 5, 7, 2, 4, 5, 2)
> v2 = c("aa", "bb", "cc", "dd", "aa", "bb", "aa", "bb")
> v3 = factor(c(1, 0, 0, 0, 1, 0, 1, 1) )
> dfNoCommas = data.frame(v1, v2, v3)
>
> # df with commas
> v1 = c(2, 3, 5, 7, 2, 4, 5, 2)
> v2 = c("a,a", "b,b", "c,c", "d,d", "a,a", "b,b", "a,a", "b,b")
> v3 = factor(c(1, 0, 0, 0, 1, 0, 1, 1) )
> dfCommas = data.frame(v1, v2, v3)
>
> # load C5 library
> library(C50)
>
> # train a model with the no commas df
> trainNoCommas <- C5.0(formula = v3 ~ .
+ , data = dfNoCommas[,!colnames(dfNoCommas) %in% c("v3")]
+ , trials = 1
+ , rules = TRUE
+ , control = C5.0Control()
+ )
>
> # train a model with the commas df
> trainCommas <- C5.0(formula = v3 ~ .
+ , data = dfCommas[,!colnames(dfCommas) %in% c("v3")]
+ , trials = 1
+ , rules = TRUE
+ , control = C5.0Control()
+ )
c50 code called exit with value 1
>
> # see package versions
> sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] grid parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] RODBC_1.3-15 C50_0.1.1 AUC_0.3.0 adabag_4.2 pROC_1.11.0 smbinning_0.6 Formula_1.2-2 partykit_1.2-0
[9] rpart_4.1-11 mvtnorm_1.0-7 libcoin_1.0-1 sqldf_0.4-11 RSQLite_2.1.0 gsubfn_0.7 proto_1.0.0 stringr_1.3.0
[17] caret_6.0-79 ggplot2_2.2.1 lattice_0.20-35 doParallel_1.0.11 iterators_1.0.9 foreach_1.4.4
loaded via a namespace (and not attached):
[1] nlme_3.1-131 lubridate_1.7.2 bit64_0.9-7 dimRed_0.1.0 tools_3.4.3 R6_2.2.2 DBI_0.8
[8] lazyeval_0.2.1 colorspace_1.3-2 nnet_7.3-12 withr_2.1.1 tidyselect_0.2.4 mnormt_1.5-5 bit_1.1-12
[15] compiler_3.4.3 chron_2.3-52 Cubist_0.2.1 scales_0.5.0 sfsmisc_1.1-2 DEoptimR_1.0-8 psych_1.7.8
[22] robustbase_0.92-8 digest_0.6.15 foreign_0.8-69 pkgconfig_2.0.1 rlang_0.2.0 ddalpha_1.3.2 bindr_0.1
[29] dplyr_0.7.4 ModelMetrics_1.1.0 magrittr_1.5 Matrix_1.2-12 Rcpp_0.12.15 munsell_0.4.3 abind_1.4-5
[36] stringi_1.1.6 inum_1.0-0 MASS_7.3-47 plyr_1.8.4 recipes_0.1.2 blob_1.1.1 splines_3.4.3
[43] pillar_1.2.1 tcltk_3.4.3 xgboost_0.6.4.1 reshape2_1.4.3 codetools_0.2-15 stats4_3.4.3 CVST_0.2-1
[50] magic_1.5-8 glue_1.2.0 data.table_1.10.4-3 gtable_0.2.0 purrr_0.2.4 tidyr_0.8.0 kernlab_0.9-25
[57] assertthat_0.2.0 DRR_0.0.3 gower_0.1.2 prodlim_1.6.1 broom_0.4.3 class_7.3-14 survival_2.41-3
[64] geometry_0.3-6 timeDate_3043.102 RcppRoll_0.2.2 tibble_1.4.2 memoise_1.1.0 bindrcpp_0.2 lava_1.6.1
[71] ipred_0.9-6
This looks like a limitation in the C5.0 C code. You can escape other characters but I've been testing a bit and it doesn't accept this inside the data values.
You might dummy up some application files to verify. If it doesn't work, I'd email RuleQuest and see if Quinlan can make a change.
Same problem here: had no problem before, but after upgrading, commas in variables break the training proccess :-(
Will check escaping the commas and report back...
(EDITED):
Sorry, don't have time... I've downgraded with install_version("C50", version = "0.1.0-24", repos = "http://cran.us.r-project.org")
to get the old comma-tolerant functionality...