mlr
mlr copied to clipboard
createDummyFeatures gives wrong and unconsistant names used with two factor levels
Description
createDummyFeatures gives wrong and unconsistant names used with two factor levels.
Reproducible example
d <- structure(list(
a = structure(c(2L, 1L, 1L, 1L, 3L, 2L), .Label = c("1", "2", "3"), class = "factor"),
b = structure(c(2L, 1L, 1L, 1L, 1L, 2L), .Label = c("1", "2"), class = "factor"),
target = structure(c(1L, 1L, 1L, 2L, 2L, 2L), .Label = c("zero", "one"), class = "factor")),
row.names = c(NA, -6L),
class = "data.frame")
# the variable a becomes a.2 and a.3 when dummified but the variable b becomes X2 (instead of an expected b.2)
mlr::createDummyFeatures(d, target = "target", method = "reference")
Thanks for reporting. Looks like a bug on the first look. Not sure when I'll have some time to look at it.
You might want to try dummy encoding in {mlr3pipelines} and use {mlr3} in general, I'd think/hope that it works just as expected there.
Thanks, I will try mlr3pipelines then.
Regarding mlr, a small change in the function createDummyFeatures.data.frame fixed the issue for me.
# initial code in mlr
# if (method == "reference" && length(work.cols) == length(dummies)) {
# colnames(dummies) = Map(function(col, pre) {
# stri_paste(pre, tail(levels(col), -1), sep = ".")
# }, obj[work.cols], prefix)
# }
# changed version
if (method == "reference") {
colnames(dummies) = unlist(Map(function(col, pre) {
stri_paste(pre, tail(levels(col), -1), sep = ".")
}, obj[work.cols], prefix))
}