mlr3pipelines
mlr3pipelines copied to clipboard
Create custom PipeOp - winsorization
Hi,
I am trying to develop my first custom PipeOp and kindly ask for feedback on my try.
I want to implement a winsorization process on train / test data set.
Here is my try:
library(mlr3pipelines)
library(mlr3verse)
library(mlr3misc)
library(R6)
PipeOpWinsorize = R6::R6Class(
"PipeOpWinsorize",
inherit = mlr3pipelines::PipeOpTaskPreproc,
public = list(
initialize = function(id = "scale.always.simple", param_vals = list()) {
ps = ParamSet$new(list(
ParamDbl$new("probs_low", lower = 0, upper = 1, tags = c("winsorize_tag")),
ParamDbl$new("probs_high", lower = 0, upper = 1, tags = c("winsorize_tag")),
ParamDbl$new("minval", special_vals = list(NULL), tags = c("winsorize_tag")),
ParamDbl$new("maxval", special_vals = list(NULL), tags = c("winsorize_tag")),
ParamLgl$new("na.rm", tags = c("winsorize_tag")),
ParamDbl$new("type", lower = 1, upper = 9, default = 7, tags = c("winsorize_tag"))
))
ps$values = list(probs_low = 0.05, probs_high = 0.95, minval = NULL, maxval = NULL, na.rm = FALSE, type = 7)
super$initialize(id, param_set = ps, param_vals = param_vals)
}
),
private = list(
.train_dt = function(dt, levels, target) {
self$state <- list()
pv = self$param_set$get_values(tags = "winsorize_tag")
wnsr = lapply(dt, winsorize)
wnsr
},
.predict_dt = function(dt, levels) {
dt
}
)
)
# This function is minor modification of Winsorize function from DescTools package
winsorize <- function (x, minval = NULL, maxval = NULL, probs_low = 0.05, probs_high = 0.95, na.rm = FALSE, type = 7) {
if (is.null(minval) || is.null(maxval)) {
xq <- quantile(x = x, probs = c(probs_low, probs_high), na.rm = na.rm, type = type)
if (is.null(minval))
minval <- xq[1L]
if (is.null(maxval))
maxval <- xq[2L]
}
x[x < minval] <- minval
x[x > maxval] <- maxval
return(x)
}
# try pipe
task = tsk("iris")
task$head()
task$data()[, max(Petal.Length)] # 6.9
gr = Graph$new()
gr$add_pipeop(PipeOpWinsorize$new())
gr$plot()
result = gr$train(task)
result[[1]]$data()
result[[1]]$data()[, max(Petal.Length)] # 6.1
I would like to know:
- Is this a good approach to develop this kind of PipeOps?
- Does this approach apply winsorize on whole data at one or it applies winsorize on train and test set separately?
- Should I make PR for this function? If yes, where do you have guidelines for PR's?
Best,
Mislav
Hi @MislavSag
Sorry for the silence, a few comments and perhaps answers:
- In general yes!
- The
$train
function decides what happens with the training data. In order to apply to test data, you would need to also applywinsorize
in$predict
. But in order to do this, you would need to store the quantiles from training data and apply those during test. Integratingwinsorize
as a member function, therefore, seems sensible. - In general yes, in practice it depends how often we think the feature will be used. We are currently focussing a little more on stability and speedups of the core package and less on new features. I think we need to think about new
PipeOps
like the ones proposed and how we can maintain them. https://github.com/mlr-org/mlr3#contributing-to-mlr3
After long time I have started to develope in mlr3 again and I neew to use winsorization again :)
Here is my current try:
library(mlr3pipelines)
library(mlr3verse)
library(mlr3misc)
library(R6)
library(paradox)
PipeOpWinsorize = R6::R6Class(
"PipeOpWinsorize",
inherit = mlr3pipelines::PipeOpTaskPreproc,
public = list(
groups = NULL,
initialize = function(id = "winsorization", param_vals = list()) {
ps = ParamSet$new(list(
ParamDbl$new("probs_low", lower = 0, upper = 1, tags = c("winsorize_tag")),
ParamDbl$new("probs_high", lower = 0, upper = 1, tags = c("winsorize_tag")),
ParamLgl$new("na.rm", tags = c("winsorize_tag")),
ParamDbl$new("type", lower = 1, upper = 9, default = 7, tags = c("winsorize_tag"))
))
ps$values = list(probs_low = 0.05, probs_high = 0.95, na.rm = FALSE, type = 7)
super$initialize(id, param_set = ps, param_vals = param_vals, feature_types = c("numeric", "integer"))
}
),
private = list(
.select_cols = function(task) {
self$groups = task$groups
task$feature_names
},
.train_dt = function(dt, levels, target) {
# params
pv = self$param_set$get_values(tags = "winsorize_tag")
# state variables
if (!(is.null(self$groups))) {
row_ids = self$groups[group == self$groups[nrow(self$groups), group], row_id]
q = dt[row_ids, lapply(.SD,
quantile,
probs = c(pv$probs_low, pv$probs_high),
na.rm = pv$na.rm,
type = pv$type)]
} else {
q = dt[, lapply(.SD,
quantile,
probs = c(pv$probs_low, pv$probs_high),
na.rm = pv$na.rm,
type = pv$type)]
}
self$state = list(
minvals = q[1],
maxvals = q[2]
)
# dt object train
if (!(is.null(self$groups))) {
dt = dt[, lapply(.SD, function(x){
q = quantile(x,
probs = c(pv$probs_low, pv$probs_high),
na.rm = pv$na.rm,
type = pv$type)
minval = q[1L]
maxval = q[2L]
x[x < minval] <- minval
x[x > maxval] <- maxval
x
}), by = self$groups[, group]]
dt = dt[, -1]
} else {
dt = dt[, lapply(.SD, function(x){
q = quantile(x,
probs = c(pv$probs_low, pv$probs_high),
na.rm = pv$na.rm,
type = pv$type)
minval = q[1L]
maxval = q[2L]
x[x < minval] <- minval
x[x > maxval] <- maxval
x
})]
}
dt
},
.predict_dt = function(dt, levels) {
dt = dt[, Map(function(a, b) ifelse(a < b, b, a), .SD, self$state$minvals)]
dt = dt[, Map(function(a, b) ifelse(a > b, b, a), .SD, self$state$maxvals)]
dt
}
)
)
library("mlr3")
# no group variable
task = tsk("iris")
gr = Graph$new()
gr$add_pipeop(PipeOpWinsorize$new(param_vals = list(probs_low = 0.1, probs_high = 0.9, na.rm = TRUE)))
result = gr$train(task)
result[[1]]$data()
predres = gr$predict(task)
# group variable
dt = tsk("iris")$data()
dt[, monthid := c(rep(1, 50), rep(2, 50), rep(3, 50))]
task = as_task_classif(dt, target = "Species")
task$set_col_roles("monthid", "group")
gr = Graph$new()
gr$add_pipeop(PipeOpWinsorize$new(param_vals = list(probs_low = 0.1, probs_high = 0.9, na.rm = TRUE)))
result = gr$train(task)
result[[1]]$data()
I kindly as k for your feedback. Would you use similar apprach?
I am trying to apply winsorization on tables with a time dimensions. In this, TS setting, we ussually want to apply winsorization on cross section basis. In my case across months. I couldn't figure out a better approach than using the groups filed. I am not sure if that would mess integration with other pipes.