mlr3pipelines Create custom PipeOp

Hi,

I am trying to develop my first custom PipeOp and kindly ask for feedback on my try.

I want to implement a winsorization process on train / test data set.

Here is my try:

library(mlr3pipelines)
library(mlr3verse)
library(mlr3misc)
library(R6)

PipeOpWinsorize = R6::R6Class(
  "PipeOpWinsorize",
  inherit = mlr3pipelines::PipeOpTaskPreproc,
  public = list(
    initialize = function(id = "scale.always.simple", param_vals = list()) {
      ps = ParamSet$new(list(
        ParamDbl$new("probs_low", lower = 0, upper = 1, tags = c("winsorize_tag")),
        ParamDbl$new("probs_high", lower = 0, upper = 1, tags = c("winsorize_tag")),
        ParamDbl$new("minval", special_vals = list(NULL), tags = c("winsorize_tag")),
        ParamDbl$new("maxval", special_vals = list(NULL), tags = c("winsorize_tag")),
        ParamLgl$new("na.rm", tags = c("winsorize_tag")),
        ParamDbl$new("type", lower = 1, upper = 9, default = 7, tags = c("winsorize_tag"))
      ))
      ps$values = list(probs_low = 0.05, probs_high = 0.95, minval = NULL, maxval = NULL, na.rm = FALSE, type = 7)
      super$initialize(id, param_set = ps, param_vals = param_vals)
    }
  ),
  
  private = list(
    
    .train_dt = function(dt, levels, target) {
      self$state <- list()
      pv = self$param_set$get_values(tags = "winsorize_tag")
      wnsr = lapply(dt, winsorize)
      wnsr
    },
    
    .predict_dt = function(dt, levels) {
      dt
      }
  )
)


# This function is minor modification of Winsorize function from DescTools package
winsorize <- function (x, minval = NULL, maxval = NULL, probs_low = 0.05, probs_high = 0.95, na.rm = FALSE, type = 7) {
  if (is.null(minval) || is.null(maxval)) {
    xq <- quantile(x = x, probs = c(probs_low, probs_high), na.rm = na.rm, type = type)
    if (is.null(minval)) 
      minval <- xq[1L]
    if (is.null(maxval)) 
      maxval <- xq[2L]
  }
  x[x < minval] <- minval
  x[x > maxval] <- maxval
  return(x)
}


# try pipe
task = tsk("iris")
task$head()
task$data()[, max(Petal.Length)] # 6.9
gr = Graph$new()
gr$add_pipeop(PipeOpWinsorize$new())
gr$plot()
result = gr$train(task)
result[[1]]$data()
result[[1]]$data()[, max(Petal.Length)] # 6.1

I would like to know:

Is this a good approach to develop this kind of PipeOps?
Does this approach apply winsorize on whole data at one or it applies winsorize on train and test set separately?
Should I make PR for this function? If yes, where do you have guidelines for PR's?

Best,

Mislav

May 31 '21 12:05 MislavSag

Hi @MislavSag

Sorry for the silence, a few comments and perhaps answers:

In general yes!
The $train function decides what happens with the training data. In order to apply to test data, you would need to also apply winsorize in $predict. But in order to do this, you would need to store the quantiles from training data and apply those during test. Integrating winsorize as a member function, therefore, seems sensible.
In general yes, in practice it depends how often we think the feature will be used. We are currently focussing a little more on stability and speedups of the core package and less on new features. I think we need to think about new PipeOps like the ones proposed and how we can maintain them. https://github.com/mlr-org/mlr3#contributing-to-mlr3

Jul 14 '21 10:07 pfistfl

After long time I have started to develope in mlr3 again and I neew to use winsorization again :)

Here is my current try:

library(mlr3pipelines)
library(mlr3verse)
library(mlr3misc)
library(R6)
library(paradox)

PipeOpWinsorize = R6::R6Class(
  "PipeOpWinsorize",
  inherit = mlr3pipelines::PipeOpTaskPreproc,
  public = list(
    groups = NULL,
    initialize = function(id = "winsorization", param_vals = list()) {
      ps = ParamSet$new(list(
        ParamDbl$new("probs_low", lower = 0, upper = 1, tags = c("winsorize_tag")),
        ParamDbl$new("probs_high", lower = 0, upper = 1, tags = c("winsorize_tag")),
        ParamLgl$new("na.rm", tags = c("winsorize_tag")),
        ParamDbl$new("type", lower = 1, upper = 9, default = 7, tags = c("winsorize_tag"))
      ))
      ps$values = list(probs_low = 0.05, probs_high = 0.95, na.rm = FALSE, type = 7)
      super$initialize(id, param_set = ps, param_vals = param_vals, feature_types = c("numeric", "integer"))
    }
  ),

  private = list(

    .select_cols = function(task) {
      self$groups = task$groups
      task$feature_names
    },

    .train_dt = function(dt, levels, target) {
      # params
      pv = self$param_set$get_values(tags = "winsorize_tag")

      # state variables
      if (!(is.null(self$groups))) {
        row_ids  = self$groups[group == self$groups[nrow(self$groups), group], row_id]
        q = dt[row_ids, lapply(.SD,
                               quantile,
                               probs = c(pv$probs_low, pv$probs_high),
                               na.rm = pv$na.rm,
                               type = pv$type)]
      } else {
        q = dt[, lapply(.SD,
                        quantile,
                        probs = c(pv$probs_low, pv$probs_high),
                        na.rm = pv$na.rm,
                        type = pv$type)]
      }
      self$state = list(
        minvals = q[1],
        maxvals = q[2]
      )

      # dt object train
      if (!(is.null(self$groups))) {
        dt = dt[, lapply(.SD, function(x){
          q = quantile(x,
                       probs = c(pv$probs_low, pv$probs_high),
                       na.rm = pv$na.rm,
                       type = pv$type)
          minval = q[1L]
          maxval = q[2L]
          x[x < minval] <- minval
          x[x > maxval] <- maxval
          x
        }), by = self$groups[, group]]
        dt = dt[, -1]
      } else {
        dt = dt[, lapply(.SD, function(x){
          q = quantile(x,
                       probs = c(pv$probs_low, pv$probs_high),
                       na.rm = pv$na.rm,
                       type = pv$type)
          minval = q[1L]
          maxval = q[2L]
          x[x < minval] <- minval
          x[x > maxval] <- maxval
          x
        })]
      }
      dt
    },

    .predict_dt = function(dt, levels) {
      dt = dt[, Map(function(a, b) ifelse(a < b, b, a), .SD, self$state$minvals)]
      dt = dt[, Map(function(a, b) ifelse(a > b, b, a), .SD, self$state$maxvals)]
      dt
    }
  )
)


library("mlr3")

# no group variable
task = tsk("iris")
gr = Graph$new()
gr$add_pipeop(PipeOpWinsorize$new(param_vals = list(probs_low = 0.1, probs_high = 0.9, na.rm = TRUE)))
result = gr$train(task)
result[[1]]$data()
predres = gr$predict(task)

# group variable
dt = tsk("iris")$data()
dt[, monthid := c(rep(1, 50), rep(2, 50), rep(3, 50))]
task = as_task_classif(dt, target = "Species")
task$set_col_roles("monthid", "group")
gr = Graph$new()
gr$add_pipeop(PipeOpWinsorize$new(param_vals = list(probs_low = 0.1, probs_high = 0.9, na.rm = TRUE)))
result = gr$train(task)
result[[1]]$data()

I kindly as k for your feedback. Would you use similar apprach?

I am trying to apply winsorization on tables with a time dimensions. In this, TS setting, we ussually want to apply winsorization on cross section basis. In my case across months. I couldn't figure out a better approach than using the groups filed. I am not sure if that would mess integration with other pipes.

Feb 02 '23 14:02 MislavSag

mlr3pipelines
mlr3pipelines copied to clipboard

Create custom PipeOp - winsorization

mlr3pipelines mlr3pipelines copied to clipboard

Create custom PipeOp - winsorization

mlr3pipelines
mlr3pipelines copied to clipboard