mlr3pipelines
mlr3pipelines copied to clipboard
PipeOpFilterRows
PipeOpFilterRows allows to filter the rows of the data of a task according to a filter parameter and adds an easy way for removing rows with missing values, related to #301
~~PipeOpPredictionUnion allows to combine predictions of the same task_type andpredict_type into a new Prediction. This maybe needs some discussion as combining predictions is imo not that straightforward.~~ dropped this from the PR
Codecov Report
Merging #410 into master will increase coverage by
0.22%. The diff coverage is100.00%.
@@ Coverage Diff @@
## master #410 +/- ##
==========================================
+ Coverage 93.82% 94.05% +0.22%
==========================================
Files 68 70 +2
Lines 2025 2103 +78
==========================================
+ Hits 1900 1978 +78
Misses 125 125
| Impacted Files | Coverage Δ | |
|---|---|---|
| R/PipeOpFilterRows.R | 100.00% <100.00%> (ø) |
|
| R/PipeOpPredictionUnion.R | 100.00% <100.00%> (ø) |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact),ø = not affected,? = missing dataPowered by Codecov. Last update 5579e15...92c9fa7. Read the comment docs.
Codecov Report
Merging #410 into master will increase coverage by
0.23%. The diff coverage is100.00%.
@@ Coverage Diff @@
## master #410 +/- ##
==========================================
+ Coverage 93.30% 93.54% +0.23%
==========================================
Files 70 72 +2
Lines 2167 2245 +78
==========================================
+ Hits 2022 2100 +78
Misses 145 145
| Impacted Files | Coverage Δ | |
|---|---|---|
| R/PipeOpFilterRows.R | 100.00% <100.00%> (ø) |
|
| R/PipeOpPredictionUnion.R | 100.00% <100.00%> (ø) |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact),ø = not affected,? = missing dataPowered by Codecov. Last update 141f60e...f7798ad. Read the comment docs.
I'm not sure how mlr3 currently handles / should handle Prediction with missing rows. We may also want to use classifavg / regavg for unioning because they would support multiple predictions of the same ID.
Also the need to filter rows during training (e.g. based on NA or some other property that a learner cannot handle) may be very different from this splitting up predictions between learners thing?
Cleaned this up. Currently has two parameters:
filter_formula(formulawith only a rhs evaluating toTRUEorFALSEwithin the data of theTaskindexing rows to keep)na_selector(Selectorfunction selecting features that should be checked for missing values results in rows having missing values w.r.t to these features being dropped)
I agree with @mb706 that this is somewhat redundant, because:
po1 = po("filterrows", param_vals = list(filter_formula = ~ !is.na(insulin)))
po2 = po("filterrows", param_vals = list(na_selector = selector_name("insulin")))
results in virtually the same operation being done.
However, what if I have a huge number of features and want to e.g. remove missing values w.r.t to factorial features? The approach via filter_formula = ~ !is.na(feature1) & !is.na(feature2) & ... seems quite cumbersome compared to na_selector = selector_type("factor")? I guess you could somewhat auto-generate the formula in this case but I am not sure if this is really nicer..
Maybe we should just split this up into PipeOpFilterRows and PipeOpRemoveMissings to not confuse users and state in PipeOpFilterRows that PipeOpRemoveMissings provides a cleaner interface for missing value removal/complete case analysis.
That argument for 'nacols" would also apply to other conditions to filter by, e.g. trying to filter if any of a number of features is 0 or negative. We could go the data.table row and
- Introduce the
.SDvariable that is just adata.tableof the dataset - introduce an
SDcolshyperparameter that selects the features that should be present in.SD
So then dropping rows where any feature is NA would go down to
SDcols = selector_grep("^nonNA\\.")
filter_formula = ~ !apply(is.na(.SD), 1, any)
So this is supposed to filter out rows during predict time, right? Is there a way to keep track of what sample is being predicted, once a learner sees the incomplete prediction dataset? So if my input task has two rows, one is dropped, my Graph returns one prediction, which sample does the prediction belong to?
Suppose that is not a problem, what is the current state of mlr3 handling missing predictions? ignores them? uses a fallback learner? just crashes? In the latter case we whould modify GraphLearner to do something sensible with missing predictions.
Also, maybe this PipeOp have the option to select on what phase samples are filtered? E.g. "train", "predict", "always".
That argument for 'nacols" would also apply to other conditions to filter by, e.g. trying to filter if any of a number of features is 0 or negative. We could go the
data.tablerow and1. Introduce the `.SD` variable that is just a `data.table` of the dataset 2. introduce an `SDcols` hyperparameter that selects the features that should be present in `.SD`So then dropping rows where any feature is
NAwould go down toSDcols = selector_grep("^nonNA\\.") filter_formula = ~ !apply(is.na(.SD), 1, any)
I like the idea of using SDcols. I included this. So filtering is now only specified using the filter_formula which can rely on the .SD variable because it is evaluated within the frame of the data.table backend of the Task.
So this is supposed to filter out rows during predict time, right? Is there a way to keep track of what sample is being predicted, once a learner sees the incomplete prediction dataset? So if my input task has two rows, one is dropped, my Graph returns one prediction, which sample does the prediction belong to?
Suppose that is not a problem, what is the current state of
mlr3handling missing predictions? ignores them? uses a fallback learner? just crashes? In the latter case we whould modifyGraphLearnerto do something sensible with missing predictions.Also, maybe this
PipeOphave the option to select on what phase samples are filtered? E.g."train","predict","always".
I think we are probably mixing up different use cases. I believe what @pfistfl wanted to be possible in #301 is to simple split your task based on some logic both during training and prediction (with the row ids of the splits being disjunct and their union being equal to the row ids of the complete task). In this case using PipeOpFilterRows would look somewhat like this:
task = tsk("pima")
gr1 = po("filterrows", param_vals = list(filter_formula = ~ age < 31)) %>>% lrn("classif.rpart")
gr2 = po("filterrows", param_vals = list(filter_formula = ~ age >= 31)) %>>% lrn("classif.rpart")
gr1$train(task)
gr2$train(task)
p1 = gr1$predict(task)[[1L]] # correct row_ids are pertained
p2 = gr2$predict(task)[[1L]]
setequal(union(p1$row_ids, p2$row_ids), task$row_ids)
Then we would only need another PipeOp that would do nothing during training and combine the predictions during prediction, i.e., simply wrapping c so we could have something like copy %>>% list(filterrows1, filterrows2) %>>% unite and the prediction output would be:
c(p1, p2)
<PredictionClassif> for 768 observations:
row_id truth response
4 neg neg
6 neg neg
7 pos neg
---
763 neg neg
764 neg neg
767 pos pos
Codecov Report
Merging #410 (f7798ad) into master (fc1e85a) will decrease coverage by
0.92%. The diff coverage is100.00%.
@@ Coverage Diff @@
## master #410 +/- ##
==========================================
- Coverage 94.46% 93.54% -0.93%
==========================================
Files 77 72 -5
Lines 2528 2245 -283
==========================================
- Hits 2388 2100 -288
- Misses 140 145 +5
| Impacted Files | Coverage Δ | |
|---|---|---|
| R/PipeOpFilterRows.R | 100.00% <100.00%> (ø) |
|
| R/PipeOpPredictionUnion.R | 100.00% <100.00%> (ø) |
|
| R/greplicate.R | 0.00% <0.00%> (-100.00%) |
:arrow_down: |
| R/PipeOpImputeSample.R | 60.00% <0.00%> (-30.00%) |
:arrow_down: |
| R/PipeOpImpute.R | 76.19% <0.00%> (-18.75%) |
:arrow_down: |
| R/PipeOpImputeMean.R | 90.00% <0.00%> (-10.00%) |
:arrow_down: |
| R/PipeOpImputeMedian.R | 90.00% <0.00%> (-10.00%) |
:arrow_down: |
| R/mlr_graphs_elements.R | 82.79% <0.00%> (-9.29%) |
:arrow_down: |
| R/GraphLearner.R | 94.00% <0.00%> (-2.39%) |
:arrow_down: |
| R/Graph.R | 87.40% <0.00%> (-1.28%) |
:arrow_down: |
| ... and 40 more |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact),ø = not affected,? = missing dataPowered by Codecov. Last update 36ca770...84d7341. Read the comment docs.
Hello,
I thought this new feature might help my issue #566. I gave it a try using this commit and then adding the line po("filterrows", param_vals = list(filter_formula = ~ !apply(is.na(.SD), 1, any))) before my svm learner. But I still got the error about mismatching truth and response lengths so either I'm doing something wrong or this won't solve #566. Please let me know if I'm being really thick!
Hi @py9mrg,
your approach above does not work out of the box because the SDcols hyperparameter of PipeOpFilterRows is set to selector_all() by default, and selector_all() called on a Task will only return the feature names of the task. As you deal with missing values in your target variable you can provide your own selector, e.g.:
selector_all_inc_target = function() {
make_selector(function(task) {
c(task$feature_names, task$target_names)
}, description = "selector_all_inc_target")
}
then use this in your pipeop:
po = po("filterrows",
param_vals = list(
filter_formula = ~ !apply(is.na(.SD), 1, any),
SDcols = selector_all_inc_target()
)
)
this would result in the 3rd row also being dropped:
po$train(list(task_w_missing))[[1]]$data()
target variable1 variable2
1: 32 4 4
2: 50 5 5
3: 72 6 6
4: 98 7 7
...
All in all this should work for you.
However, there is some inconsistency now that the state of the po would only list your features "variable1" and "variable2" as being affected (although your target is also). We may have to discuss what to do with PipeOps affecting targets in general. Also I will try to finish this PR up soon and maybe refactor it. Hope this helps.
Hello @sumny ,
Thanks so much for coming back to me. That works! Just one thing in case anyone is reading with a similar use case, because make_selector is not exported, I had to use mlr3pipelines:::make_selector. Took me longer than I care to admit to work that out.
Thanks again for your help.
Hello @sumny, do you know when this PR might be merged? I'm using it all the time and it would be nice, when not using renv, to not have to keep rolling mlr3pipelines back to this commit every time I forget to untick the update box in RStudio for this particular package!
Edit: oh I forgot to say, it's probably me, but I can only get the 84d7341 commit to recognise the new filterrows po. I can see it getting the html documentation when installing the other commits as mlr_pipeops_filterrows html appears, but po("filterrows", etc) only works with the older commit.
@py9mrg, @mb706 and I have to go over it and check whether the complete functionality is still within the scope of mlr3pipelines. Also, selecting targets via selectors is still an open discussion #493 I will ping you once it is merged - sorry for keeping you waiting so long!
@sumny thanks and don't worry - I'm sure I am completely underestimating how significant the change is.