mlr3pipelines Caching Prototype

Caching Prototype

Open pfistfl opened this issue 4 years ago • 9 comments

See caching.md

Apr 01 '20 23:04 pfistfl

My thoughts about things to consider, in random order:

with some operations it may make more sense to save just the $state and not the result. Then during $train() the caching mechanism can set the state from cache and call $.predict().
PipeOps should contain metadata about whether they are deterministic or not, and whether their .train() and .predict() results are the same whenever the input to both is the same (use common vs. separate cache)
caching in mlrCPO was a wrapper-PipeOp, we could also have that here. Pro: For multiple operations only the last output needs to be saved; makes the configuration of different caching mechanisms easier. Cons: We get the drawbacks of wrapping: the graph structure gets obscured. Also when wrapping multiple operations and just one of them is nondeterministic everything falls apart. We may want a ppl() function that wraps a graph optimally so that linear deterministic segments are cached together and only the output of the last PipeOp is kept. (Also works for arbitrary Graphs).

Apr 02 '20 07:04 mb706

I added your comments and some responses.

Apr 02 '20 07:04 pfistfl

Have to we concluded to do this in a per-package base now rather than upstream in mlr3?

Apr 19 '20 09:04 pat-s

Have to we concluded to do this in a per-package base now rather than upstream in mlr3?

This would actually be broader then doing it in mlr3. As every step in the pipeline is potentially cached, this includes

learners
filters
data transform pipeops

The only drawback would then be that in order to benefit from caching, those would need to be part of a pipeline, which they should be anyways in most cases.

benchmark() would then be cached by wrapping each learner inside a GraphLearner, i.e. lrns = map(lrns, function(lrn) GraphLearner$new(po(lrn)))

Apr 19 '20 09:04 pfistfl

OK. I again want to raise awareness that people who want to use mlr3 without pipelines should also profit from caching. For example, when filtering users should be able to profit from caching but this would require adding a per-package caching implementation then (there might be more ext packages besides filters). This one could potentially conflict with the pipelines caching.

Also, did you look how drake does this? In the end its also a workflow package that aims to cache steps within the "Graph" and detect if there are steps in the cache that do not need to be rerun. Just trying to potentially save your time for tasks that might reinvent the wheel - even though mlr3pipelines might be completely different to how drake does it.

Apr 19 '20 10:04 pat-s

This one could potentially conflict with the pipelines caching.

In general, different levels of caching should not interfere, the worst case I could imagine would be to cache things twice, i.e. PipeOp and Filter cache their results. This would just mean that PipeOps that operate on something that itself knows how to cache things would be adjusted to deactivate lower-level caching.

people who want to use mlr3 without pipelines should also profit from caching.

I can not really judge this, but I am not sure I agree. I agree that we should provide enough convenience functions to enable people to work without learning all the in's and out's of 'mlr3pipelines`

Currently it is as complicated as

flt("variance") %>>% lrn("classif.rpart")

to write a filtered learner.

So correct me if I am wrong, but when would you do filtering without using mlr3pipelines? That would only be the case when you do train/test split and filtering manually?

I have not looked at drake too deep but the current caching implementation that covers everything mlr3pipelnes needs has < 40 lines.

Apr 19 '20 15:04 pfistfl

This would just mean that PipeOps that operate on something that itself knows how to cache things would be adjusted to deactivate lower-level caching.

Yeah or maybe to rely on the lower-level caching implementation if it exists.

So correct me if I am wrong, but when would you do filtering without using mlr3pipelines? That would only be the case when you do train/test split and filtering manually?

Yeah maybe it does not make sense and mlr3pipelines is a "must-use" in the whole game. For now I've only done toy benchmarks without any wrapped learners - all of these are still written in the old mlr. And I can probably also combine drake and mlr3pipelines, maybe even both caching approaches.

Maybe you find mlr3-learnerdrake interesting. We/I should extend it with a pipelines example.

Apr 19 '20 21:04 pat-s

Review Bernd:

Should PipeOp Ids be cached
Release this in a separate release
Add blogpost, showcase results
Thoroughly testride

Apr 30 '20 12:04 pfistfl

mlr3pipelines mlr3pipelines copied to clipboard

Caching Prototype

mlr3pipelines
mlr3pipelines copied to clipboard