mlr3pipelines icon indicating copy to clipboard operation
mlr3pipelines copied to clipboard

Create custom PipeOp - winsorization

Open MislavSag opened this issue 3 years ago • 2 comments

Hi,

I am trying to develop my first custom PipeOp and kindly ask for feedback on my try.

I want to implement a winsorization process on train / test data set.

Here is my try:

library(mlr3pipelines)
library(mlr3verse)
library(mlr3misc)
library(R6)

PipeOpWinsorize = R6::R6Class(
  "PipeOpWinsorize",
  inherit = mlr3pipelines::PipeOpTaskPreproc,
  public = list(
    initialize = function(id = "scale.always.simple", param_vals = list()) {
      ps = ParamSet$new(list(
        ParamDbl$new("probs_low", lower = 0, upper = 1, tags = c("winsorize_tag")),
        ParamDbl$new("probs_high", lower = 0, upper = 1, tags = c("winsorize_tag")),
        ParamDbl$new("minval", special_vals = list(NULL), tags = c("winsorize_tag")),
        ParamDbl$new("maxval", special_vals = list(NULL), tags = c("winsorize_tag")),
        ParamLgl$new("na.rm", tags = c("winsorize_tag")),
        ParamDbl$new("type", lower = 1, upper = 9, default = 7, tags = c("winsorize_tag"))
      ))
      ps$values = list(probs_low = 0.05, probs_high = 0.95, minval = NULL, maxval = NULL, na.rm = FALSE, type = 7)
      super$initialize(id, param_set = ps, param_vals = param_vals)
    }
  ),
  
  private = list(
    
    .train_dt = function(dt, levels, target) {
      self$state <- list()
      pv = self$param_set$get_values(tags = "winsorize_tag")
      wnsr = lapply(dt, winsorize)
      wnsr
    },
    
    .predict_dt = function(dt, levels) {
      dt
      }
  )
)


# This function is minor modification of Winsorize function from DescTools package
winsorize <- function (x, minval = NULL, maxval = NULL, probs_low = 0.05, probs_high = 0.95, na.rm = FALSE, type = 7) {
  if (is.null(minval) || is.null(maxval)) {
    xq <- quantile(x = x, probs = c(probs_low, probs_high), na.rm = na.rm, type = type)
    if (is.null(minval)) 
      minval <- xq[1L]
    if (is.null(maxval)) 
      maxval <- xq[2L]
  }
  x[x < minval] <- minval
  x[x > maxval] <- maxval
  return(x)
}


# try pipe
task = tsk("iris")
task$head()
task$data()[, max(Petal.Length)] # 6.9
gr = Graph$new()
gr$add_pipeop(PipeOpWinsorize$new())
gr$plot()
result = gr$train(task)
result[[1]]$data()
result[[1]]$data()[, max(Petal.Length)] # 6.1

I would like to know:

  1. Is this a good approach to develop this kind of PipeOps?
  2. Does this approach apply winsorize on whole data at one or it applies winsorize on train and test set separately?
  3. Should I make PR for this function? If yes, where do you have guidelines for PR's?

Best,

Mislav

MislavSag avatar May 31 '21 12:05 MislavSag

Hi @MislavSag

Sorry for the silence, a few comments and perhaps answers:

  1. In general yes!
  2. The $train function decides what happens with the training data. In order to apply to test data, you would need to also apply winsorize in $predict. But in order to do this, you would need to store the quantiles from training data and apply those during test. Integrating winsorize as a member function, therefore, seems sensible.
  3. In general yes, in practice it depends how often we think the feature will be used. We are currently focussing a little more on stability and speedups of the core package and less on new features. I think we need to think about new PipeOps like the ones proposed and how we can maintain them. https://github.com/mlr-org/mlr3#contributing-to-mlr3

pfistfl avatar Jul 14 '21 10:07 pfistfl

After long time I have started to develope in mlr3 again and I neew to use winsorization again :)

Here is my current try:

library(mlr3pipelines)
library(mlr3verse)
library(mlr3misc)
library(R6)
library(paradox)

PipeOpWinsorize = R6::R6Class(
  "PipeOpWinsorize",
  inherit = mlr3pipelines::PipeOpTaskPreproc,
  public = list(
    groups = NULL,
    initialize = function(id = "winsorization", param_vals = list()) {
      ps = ParamSet$new(list(
        ParamDbl$new("probs_low", lower = 0, upper = 1, tags = c("winsorize_tag")),
        ParamDbl$new("probs_high", lower = 0, upper = 1, tags = c("winsorize_tag")),
        ParamLgl$new("na.rm", tags = c("winsorize_tag")),
        ParamDbl$new("type", lower = 1, upper = 9, default = 7, tags = c("winsorize_tag"))
      ))
      ps$values = list(probs_low = 0.05, probs_high = 0.95, na.rm = FALSE, type = 7)
      super$initialize(id, param_set = ps, param_vals = param_vals, feature_types = c("numeric", "integer"))
    }
  ),

  private = list(

    .select_cols = function(task) {
      self$groups = task$groups
      task$feature_names
    },

    .train_dt = function(dt, levels, target) {
      # params
      pv = self$param_set$get_values(tags = "winsorize_tag")

      # state variables
      if (!(is.null(self$groups))) {
        row_ids  = self$groups[group == self$groups[nrow(self$groups), group], row_id]
        q = dt[row_ids, lapply(.SD,
                               quantile,
                               probs = c(pv$probs_low, pv$probs_high),
                               na.rm = pv$na.rm,
                               type = pv$type)]
      } else {
        q = dt[, lapply(.SD,
                        quantile,
                        probs = c(pv$probs_low, pv$probs_high),
                        na.rm = pv$na.rm,
                        type = pv$type)]
      }
      self$state = list(
        minvals = q[1],
        maxvals = q[2]
      )

      # dt object train
      if (!(is.null(self$groups))) {
        dt = dt[, lapply(.SD, function(x){
          q = quantile(x,
                       probs = c(pv$probs_low, pv$probs_high),
                       na.rm = pv$na.rm,
                       type = pv$type)
          minval = q[1L]
          maxval = q[2L]
          x[x < minval] <- minval
          x[x > maxval] <- maxval
          x
        }), by = self$groups[, group]]
        dt = dt[, -1]
      } else {
        dt = dt[, lapply(.SD, function(x){
          q = quantile(x,
                       probs = c(pv$probs_low, pv$probs_high),
                       na.rm = pv$na.rm,
                       type = pv$type)
          minval = q[1L]
          maxval = q[2L]
          x[x < minval] <- minval
          x[x > maxval] <- maxval
          x
        })]
      }
      dt
    },

    .predict_dt = function(dt, levels) {
      dt = dt[, Map(function(a, b) ifelse(a < b, b, a), .SD, self$state$minvals)]
      dt = dt[, Map(function(a, b) ifelse(a > b, b, a), .SD, self$state$maxvals)]
      dt
    }
  )
)


library("mlr3")

# no group variable
task = tsk("iris")
gr = Graph$new()
gr$add_pipeop(PipeOpWinsorize$new(param_vals = list(probs_low = 0.1, probs_high = 0.9, na.rm = TRUE)))
result = gr$train(task)
result[[1]]$data()
predres = gr$predict(task)

# group variable
dt = tsk("iris")$data()
dt[, monthid := c(rep(1, 50), rep(2, 50), rep(3, 50))]
task = as_task_classif(dt, target = "Species")
task$set_col_roles("monthid", "group")
gr = Graph$new()
gr$add_pipeop(PipeOpWinsorize$new(param_vals = list(probs_low = 0.1, probs_high = 0.9, na.rm = TRUE)))
result = gr$train(task)
result[[1]]$data()


I kindly as k for your feedback. Would you use similar apprach?

I am trying to apply winsorization on tables with a time dimensions. In this, TS setting, we ussually want to apply winsorization on cross section basis. In my case across months. I couldn't figure out a better approach than using the groups filed. I am not sure if that would mess integration with other pipes.

MislavSag avatar Feb 02 '23 14:02 MislavSag