mlr icon indicating copy to clipboard operation
mlr copied to clipboard

createDummyFeatures gives wrong and unconsistant names used with two factor levels

Open Sade154 opened this issue 4 years ago • 2 comments

Description

createDummyFeatures gives wrong and unconsistant names used with two factor levels.

Reproducible example

d <- structure(list(
  a = structure(c(2L, 1L, 1L, 1L, 3L, 2L), .Label = c("1", "2", "3"), class = "factor"),
  b = structure(c(2L, 1L, 1L, 1L, 1L, 2L), .Label = c("1", "2"), class = "factor"),
  target = structure(c(1L, 1L, 1L, 2L, 2L, 2L), .Label = c("zero", "one"), class = "factor")),
  row.names = c(NA, -6L),
  class = "data.frame")

# the variable a becomes a.2 and a.3 when dummified but the variable b becomes X2 (instead of an expected b.2)

mlr::createDummyFeatures(d, target = "target", method = "reference")

Sade154 avatar Jul 06 '21 18:07 Sade154

Thanks for reporting. Looks like a bug on the first look. Not sure when I'll have some time to look at it.

You might want to try dummy encoding in {mlr3pipelines} and use {mlr3} in general, I'd think/hope that it works just as expected there.

pat-s avatar Jul 07 '21 07:07 pat-s

Thanks, I will try mlr3pipelines then. Regarding mlr, a small change in the function createDummyFeatures.data.frame fixed the issue for me.

# initial code in mlr
# if (method == "reference" && length(work.cols) == length(dummies)) {
#   colnames(dummies) = Map(function(col, pre) {
#     stri_paste(pre, tail(levels(col), -1), sep = ".")
#   }, obj[work.cols], prefix)
# }

# changed version
if (method == "reference") {
    colnames(dummies) = unlist(Map(function(col, pre) {
      stri_paste(pre, tail(levels(col), -1), sep = ".")
    }, obj[work.cols], prefix))
}

Sade154 avatar Jul 07 '21 09:07 Sade154