caret icon indicating copy to clipboard operation
caret copied to clipboard

dummyVars ignores sep argument for character variables

Open jakobludewig opened this issue 6 years ago • 2 comments

Hey,

I spotted a potential issue with dummyVars: When a character variable (instead of a factor variable) is included in the RHS of the formula the sep argument is ignored for these variables (ie the separators between variable name and levels are not inserted).

It's a minor issue but took me some time to figure out the reason. Is this expected behaviour? If not I'm happy to invest some time into finding a fix.

Cheers

Minimal, reproducible example:


library(earth)
library(tidyverse)
library(magrittr)
library(caret)

data(etitanic)

#  this works fine giving the correct separators between variable name and level
dummies <- dummyVars(survived ~ ., data = etitanic,sep=".")
head(predict(dummies, newdata = etitanic))
#>   pclass.1st pclass.2nd pclass.3rd sex.female sex.male     age sibsp parch
#> 1          1          0          0          1        0 29.0000     0     0
#> 2          1          0          0          0        1  0.9167     1     2
#> 3          1          0          0          1        0  2.0000     1     2
#> 4          1          0          0          0        1 30.0000     1     2
#> 5          1          0          0          1        0 25.0000     1     2
#> 6          1          0          0          0        1 48.0000     0     0

#  after converting a variable to character dummyVars fails to insert the separator character
etitanic %<>% mutate(pclass = as.character(pclass))
dummies <- dummyVars(survived ~ ., data = etitanic,sep=".")
head(predict(dummies, newdata = etitanic))
#>   pclass1st pclass2nd pclass3rd sex.female sex.male     age sibsp parch
#> 1         1         0         0          1        0 29.0000     0     0
#> 2         1         0         0          0        1  0.9167     1     2
#> 3         1         0         0          1        0  2.0000     1     2
#> 4         1         0         0          0        1 30.0000     1     2
#> 5         1         0         0          1        0 25.0000     1     2
#> 6         1         0         0          0        1 48.0000     0     0

Session Info:

>sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.4

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] bindrcpp_0.2.2     caret_6.0-81       lattice_0.20-35    magrittr_1.5       forcats_0.3.0      stringr_1.3.1     
 [7] dplyr_0.7.6        purrr_0.2.5        readr_1.1.1        tidyr_0.8.1        tibble_1.4.2       ggplot2_3.0.0     
[13] tidyverse_1.2.1    earth_4.7.0        plotmo_3.5.2       TeachingDemos_2.10 plotrix_3.7-4     

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.18       lubridate_1.7.4    class_7.3-14       assertthat_0.2.0   ipred_0.9-7        foreach_1.4.4     
 [7] R6_2.2.2           cellranger_1.1.0   plyr_1.8.4         backports_1.1.2    stats4_3.5.1       httr_1.3.1        
[13] pillar_1.3.0       rlang_0.3.1        lazyeval_0.2.1     readxl_1.1.0       rstudioapi_0.7     data.table_1.11.4 
[19] rpart_4.1-13       Matrix_1.2-14      splines_3.5.1      gower_0.1.2        munsell_0.5.0      broom_0.5.0       
[25] compiler_3.5.1     modelr_0.1.2       pkgconfig_2.0.1    nnet_7.3-12        tidyselect_0.2.4   prodlim_2018.04.18
[31] codetools_0.2-15   crayon_1.3.4       withr_2.1.2        MASS_7.3-50        recipes_0.1.4      ModelMetrics_1.2.0
[37] grid_3.5.1         nlme_3.1-137       jsonlite_1.6       gtable_0.2.0       scales_1.0.0       cli_1.0.0         
[43] stringi_1.2.4      reshape2_1.4.3     timeDate_3043.102  xml2_1.2.0         generics_0.0.2     lava_1.6.3        
[49] iterators_1.0.10   tools_3.5.1        glue_1.3.0         hms_0.4.2          survival_2.42-3    yaml_2.2.0        
[55] colorspace_1.3-2   rvest_0.3.2        bindr_0.1.1        haven_1.1.2 

jakobludewig avatar Jan 22 '19 09:01 jakobludewig

I'll try to add a MRE to this as well but seeing same behavior for the fullRank and levelsOnly arguments as well, i.e. dummyVars seems to ignore them if the RHS has character columns. Thanks to OP for calling this behavior out!

jimtheflash avatar Mar 12 '19 19:03 jimtheflash

This is still broken.

bdrhoa avatar Dec 12 '23 05:12 bdrhoa