mlr3pipelines icon indicating copy to clipboard operation
mlr3pipelines copied to clipboard

can we create a pipeopdropcollinear

Open mb706 opened this issue 3 years ago • 2 comments

somehow automatically recognize when a column is close to collinear with another column and drop it; could be useful for linear models

mb706 avatar Mar 17 '21 00:03 mb706

If we want to do something based on the variance inflation factor, we could probably integrate this as a filter, i.e., using the negative vif:

task = tsk("mtcars")

filter = flt("vic")
g = po("filter", filter, filter.cutoff = -10) %>>% lrn("regr.lm")
l = lrn("regr.lm")

bg = benchmark_grid(task, list(g, l), rsmp("cv"))
b = benchmark(bg)
b$aggregate()
   nr      resample_result task_id  learner_id resampling_id iters  regr.mse
1:  1 <ResampleResult[21]>  mtcars vic.regr.lm            cv    10  9.091071
2:  2 <ResampleResult[21]>  mtcars     regr.lm            cv    10 13.299961

sumny avatar Mar 17 '21 15:03 sumny

Making this available as a filter probably makes sense. I don't know vif, but it looks like it is dependent on the task target, while there should be something useful even without taking the target into consideration. Example usecase is if there is a PipeOpLearnerCV that outputs probabilities, where one probability column is often just 1 - sum(other probabilities) and leads to warnings if this is the input to a simple linear model. The filter would probably go from left to right through the task features and measure how collinear it is to the already seen features. The filter value would then be similar to a tolerance, so slightly non-collinear features are also excluded (that could still lead to instability in some models).

mb706 avatar Mar 17 '21 16:03 mb706