mlr3pipelines icon indicating copy to clipboard operation
mlr3pipelines copied to clipboard

PipeOpFilterRows

Open sumny opened this issue 5 years ago • 16 comments

PipeOpFilterRows allows to filter the rows of the data of a task according to a filter parameter and adds an easy way for removing rows with missing values, related to #301 ~~PipeOpPredictionUnion allows to combine predictions of the same task_type andpredict_type into a new Prediction. This maybe needs some discussion as combining predictions is imo not that straightforward.~~ dropped this from the PR

sumny avatar Apr 21 '20 21:04 sumny

Codecov Report

Merging #410 into master will increase coverage by 0.22%. The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #410      +/-   ##
==========================================
+ Coverage   93.82%   94.05%   +0.22%     
==========================================
  Files          68       70       +2     
  Lines        2025     2103      +78     
==========================================
+ Hits         1900     1978      +78     
  Misses        125      125              
Impacted Files Coverage Δ
R/PipeOpFilterRows.R 100.00% <100.00%> (ø)
R/PipeOpPredictionUnion.R 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 5579e15...92c9fa7. Read the comment docs.

codecov-io avatar Apr 21 '20 21:04 codecov-io

Codecov Report

Merging #410 into master will increase coverage by 0.23%. The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #410      +/-   ##
==========================================
+ Coverage   93.30%   93.54%   +0.23%     
==========================================
  Files          70       72       +2     
  Lines        2167     2245      +78     
==========================================
+ Hits         2022     2100      +78     
  Misses        145      145              
Impacted Files Coverage Δ
R/PipeOpFilterRows.R 100.00% <100.00%> (ø)
R/PipeOpPredictionUnion.R 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 141f60e...f7798ad. Read the comment docs.

codecov-commenter avatar Jun 07 '20 21:06 codecov-commenter

I'm not sure how mlr3 currently handles / should handle Prediction with missing rows. We may also want to use classifavg / regavg for unioning because they would support multiple predictions of the same ID.

mb706 avatar Jun 21 '20 23:06 mb706

Also the need to filter rows during training (e.g. based on NA or some other property that a learner cannot handle) may be very different from this splitting up predictions between learners thing?

mb706 avatar Jun 21 '20 23:06 mb706

Cleaned this up. Currently has two parameters:

  • filter_formula (formula with only a rhs evaluating to TRUE or FALSE within the data of the Task indexing rows to keep)
  • na_selector (Selector function selecting features that should be checked for missing values results in rows having missing values w.r.t to these features being dropped)

I agree with @mb706 that this is somewhat redundant, because:

po1 = po("filterrows", param_vals = list(filter_formula = ~ !is.na(insulin)))
po2 = po("filterrows", param_vals = list(na_selector = selector_name("insulin")))

results in virtually the same operation being done. However, what if I have a huge number of features and want to e.g. remove missing values w.r.t to factorial features? The approach via filter_formula = ~ !is.na(feature1) & !is.na(feature2) & ... seems quite cumbersome compared to na_selector = selector_type("factor")? I guess you could somewhat auto-generate the formula in this case but I am not sure if this is really nicer..

Maybe we should just split this up into PipeOpFilterRows and PipeOpRemoveMissings to not confuse users and state in PipeOpFilterRows that PipeOpRemoveMissings provides a cleaner interface for missing value removal/complete case analysis.

sumny avatar Oct 10 '20 15:10 sumny

That argument for 'nacols" would also apply to other conditions to filter by, e.g. trying to filter if any of a number of features is 0 or negative. We could go the data.table row and

  1. Introduce the .SD variable that is just a data.table of the dataset
  2. introduce an SDcols hyperparameter that selects the features that should be present in .SD

So then dropping rows where any feature is NA would go down to

SDcols = selector_grep("^nonNA\\.")
filter_formula = ~ !apply(is.na(.SD), 1, any)

mb706 avatar Oct 12 '20 15:10 mb706

So this is supposed to filter out rows during predict time, right? Is there a way to keep track of what sample is being predicted, once a learner sees the incomplete prediction dataset? So if my input task has two rows, one is dropped, my Graph returns one prediction, which sample does the prediction belong to?

Suppose that is not a problem, what is the current state of mlr3 handling missing predictions? ignores them? uses a fallback learner? just crashes? In the latter case we whould modify GraphLearner to do something sensible with missing predictions.

Also, maybe this PipeOp have the option to select on what phase samples are filtered? E.g. "train", "predict", "always".

mb706 avatar Oct 12 '20 16:10 mb706

That argument for 'nacols" would also apply to other conditions to filter by, e.g. trying to filter if any of a number of features is 0 or negative. We could go the data.table row and

1. Introduce the `.SD` variable that is just a `data.table` of the dataset

2. introduce an `SDcols` hyperparameter that selects the features that should be present in `.SD`

So then dropping rows where any feature is NA would go down to

SDcols = selector_grep("^nonNA\\.")
filter_formula = ~ !apply(is.na(.SD), 1, any)

I like the idea of using SDcols. I included this. So filtering is now only specified using the filter_formula which can rely on the .SD variable because it is evaluated within the frame of the data.table backend of the Task.

sumny avatar Oct 20 '20 15:10 sumny

So this is supposed to filter out rows during predict time, right? Is there a way to keep track of what sample is being predicted, once a learner sees the incomplete prediction dataset? So if my input task has two rows, one is dropped, my Graph returns one prediction, which sample does the prediction belong to?

Suppose that is not a problem, what is the current state of mlr3 handling missing predictions? ignores them? uses a fallback learner? just crashes? In the latter case we whould modify GraphLearner to do something sensible with missing predictions.

Also, maybe this PipeOp have the option to select on what phase samples are filtered? E.g. "train", "predict", "always".

I think we are probably mixing up different use cases. I believe what @pfistfl wanted to be possible in #301 is to simple split your task based on some logic both during training and prediction (with the row ids of the splits being disjunct and their union being equal to the row ids of the complete task). In this case using PipeOpFilterRows would look somewhat like this:

task = tsk("pima")
gr1 = po("filterrows", param_vals = list(filter_formula = ~ age < 31)) %>>% lrn("classif.rpart")
gr2 = po("filterrows", param_vals = list(filter_formula = ~ age >= 31)) %>>% lrn("classif.rpart")
gr1$train(task)
gr2$train(task)
p1 = gr1$predict(task)[[1L]]  # correct row_ids are pertained 
p2 = gr2$predict(task)[[1L]]
setequal(union(p1$row_ids, p2$row_ids), task$row_ids)

Then we would only need another PipeOp that would do nothing during training and combine the predictions during prediction, i.e., simply wrapping c so we could have something like copy %>>% list(filterrows1, filterrows2) %>>% unite and the prediction output would be:

c(p1, p2)
<PredictionClassif> for 768 observations:
    row_id truth response
         4   neg      neg
         6   neg      neg
         7   pos      neg
---                      
       763   neg      neg
       764   neg      neg
       767   pos      pos

sumny avatar Oct 20 '20 15:10 sumny

Codecov Report

Merging #410 (f7798ad) into master (fc1e85a) will decrease coverage by 0.92%. The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #410      +/-   ##
==========================================
- Coverage   94.46%   93.54%   -0.93%     
==========================================
  Files          77       72       -5     
  Lines        2528     2245     -283     
==========================================
- Hits         2388     2100     -288     
- Misses        140      145       +5     
Impacted Files Coverage Δ
R/PipeOpFilterRows.R 100.00% <100.00%> (ø)
R/PipeOpPredictionUnion.R 100.00% <100.00%> (ø)
R/greplicate.R 0.00% <0.00%> (-100.00%) :arrow_down:
R/PipeOpImputeSample.R 60.00% <0.00%> (-30.00%) :arrow_down:
R/PipeOpImpute.R 76.19% <0.00%> (-18.75%) :arrow_down:
R/PipeOpImputeMean.R 90.00% <0.00%> (-10.00%) :arrow_down:
R/PipeOpImputeMedian.R 90.00% <0.00%> (-10.00%) :arrow_down:
R/mlr_graphs_elements.R 82.79% <0.00%> (-9.29%) :arrow_down:
R/GraphLearner.R 94.00% <0.00%> (-2.39%) :arrow_down:
R/Graph.R 87.40% <0.00%> (-1.28%) :arrow_down:
... and 40 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 36ca770...84d7341. Read the comment docs.

codecov-io avatar Dec 13 '20 00:12 codecov-io

Hello,

I thought this new feature might help my issue #566. I gave it a try using this commit and then adding the line po("filterrows", param_vals = list(filter_formula = ~ !apply(is.na(.SD), 1, any))) before my svm learner. But I still got the error about mismatching truth and response lengths so either I'm doing something wrong or this won't solve #566. Please let me know if I'm being really thick!

py9mrg avatar Feb 22 '21 11:02 py9mrg

Hi @py9mrg,

your approach above does not work out of the box because the SDcols hyperparameter of PipeOpFilterRows is set to selector_all() by default, and selector_all() called on a Task will only return the feature names of the task. As you deal with missing values in your target variable you can provide your own selector, e.g.:

selector_all_inc_target = function() {
  make_selector(function(task) {
    c(task$feature_names, task$target_names)
  }, description = "selector_all_inc_target")
}

then use this in your pipeop:

po = po("filterrows",
  param_vals = list(
    filter_formula = ~ !apply(is.na(.SD), 1, any),
    SDcols = selector_all_inc_target()
  )
)

this would result in the 3rd row also being dropped:

po$train(list(task_w_missing))[[1]]$data()
    target variable1 variable2
 1:     32         4         4
 2:     50         5         5
 3:     72         6         6
 4:     98         7         7
...

All in all this should work for you. However, there is some inconsistency now that the state of the po would only list your features "variable1" and "variable2" as being affected (although your target is also). We may have to discuss what to do with PipeOps affecting targets in general. Also I will try to finish this PR up soon and maybe refactor it. Hope this helps.

sumny avatar Feb 23 '21 21:02 sumny

Hello @sumny ,

Thanks so much for coming back to me. That works! Just one thing in case anyone is reading with a similar use case, because make_selector is not exported, I had to use mlr3pipelines:::make_selector. Took me longer than I care to admit to work that out.

Thanks again for your help.

py9mrg avatar Feb 24 '21 13:02 py9mrg

Hello @sumny, do you know when this PR might be merged? I'm using it all the time and it would be nice, when not using renv, to not have to keep rolling mlr3pipelines back to this commit every time I forget to untick the update box in RStudio for this particular package!

Edit: oh I forgot to say, it's probably me, but I can only get the 84d7341 commit to recognise the new filterrows po. I can see it getting the html documentation when installing the other commits as mlr_pipeops_filterrows html appears, but po("filterrows", etc) only works with the older commit.

py9mrg avatar Apr 16 '21 12:04 py9mrg

@py9mrg, @mb706 and I have to go over it and check whether the complete functionality is still within the scope of mlr3pipelines. Also, selecting targets via selectors is still an open discussion #493 I will ping you once it is merged - sorry for keeping you waiting so long!

sumny avatar Apr 22 '21 17:04 sumny

@sumny thanks and don't worry - I'm sure I am completely underestimating how significant the change is.

py9mrg avatar Apr 22 '21 17:04 py9mrg