mlr3pipelines icon indicating copy to clipboard operation
mlr3pipelines copied to clipboard

Caching Prototype

Open pfistfl opened this issue 4 years ago • 9 comments

See caching.md

pfistfl avatar Apr 01 '20 23:04 pfistfl

My thoughts about things to consider, in random order:

  • with some operations it may make more sense to save just the $state and not the result. Then during $train() the caching mechanism can set the state from cache and call $.predict().
  • PipeOps should contain metadata about whether they are deterministic or not, and whether their .train() and .predict() results are the same whenever the input to both is the same (use common vs. separate cache)
  • caching in mlrCPO was a wrapper-PipeOp, we could also have that here. Pro: For multiple operations only the last output needs to be saved; makes the configuration of different caching mechanisms easier. Cons: We get the drawbacks of wrapping: the graph structure gets obscured. Also when wrapping multiple operations and just one of them is nondeterministic everything falls apart. We may want a ppl() function that wraps a graph optimally so that linear deterministic segments are cached together and only the output of the last PipeOp is kept. (Also works for arbitrary Graphs).

mb706 avatar Apr 02 '20 07:04 mb706

I added your comments and some responses.

pfistfl avatar Apr 02 '20 07:04 pfistfl

Have to we concluded to do this in a per-package base now rather than upstream in mlr3?

pat-s avatar Apr 19 '20 09:04 pat-s

Have to we concluded to do this in a per-package base now rather than upstream in mlr3?

This would actually be broader then doing it in mlr3. As every step in the pipeline is potentially cached, this includes

  • learners
  • filters
  • data transform pipeops

The only drawback would then be that in order to benefit from caching, those would need to be part of a pipeline, which they should be anyways in most cases.

benchmark() would then be cached by wrapping each learner inside a GraphLearner, i.e. lrns = map(lrns, function(lrn) GraphLearner$new(po(lrn)))

pfistfl avatar Apr 19 '20 09:04 pfistfl

OK. I again want to raise awareness that people who want to use mlr3 without pipelines should also profit from caching. For example, when filtering users should be able to profit from caching but this would require adding a per-package caching implementation then (there might be more ext packages besides filters). This one could potentially conflict with the pipelines caching.

Also, did you look how drake does this? In the end its also a workflow package that aims to cache steps within the "Graph" and detect if there are steps in the cache that do not need to be rerun. Just trying to potentially save your time for tasks that might reinvent the wheel - even though mlr3pipelines might be completely different to how drake does it.

pat-s avatar Apr 19 '20 10:04 pat-s

This one could potentially conflict with the pipelines caching.

In general, different levels of caching should not interfere, the worst case I could imagine would be to cache things twice, i.e. PipeOp and Filter cache their results. This would just mean that PipeOps that operate on something that itself knows how to cache things would be adjusted to deactivate lower-level caching.

people who want to use mlr3 without pipelines should also profit from caching.

I can not really judge this, but I am not sure I agree. I agree that we should provide enough convenience functions to enable people to work without learning all the in's and out's of 'mlr3pipelines`

Currently it is as complicated as

flt("variance") %>>% lrn("classif.rpart")

to write a filtered learner.

So correct me if I am wrong, but when would you do filtering without using mlr3pipelines? That would only be the case when you do train/test split and filtering manually?

I have not looked at drake too deep but the current caching implementation that covers everything mlr3pipelnes needs has < 40 lines.

pfistfl avatar Apr 19 '20 15:04 pfistfl

This would just mean that PipeOps that operate on something that itself knows how to cache things would be adjusted to deactivate lower-level caching.

Yeah or maybe to rely on the lower-level caching implementation if it exists.

So correct me if I am wrong, but when would you do filtering without using mlr3pipelines? That would only be the case when you do train/test split and filtering manually?

Yeah maybe it does not make sense and mlr3pipelines is a "must-use" in the whole game. For now I've only done toy benchmarks without any wrapped learners - all of these are still written in the old mlr. And I can probably also combine drake and mlr3pipelines, maybe even both caching approaches.

Maybe you find mlr3-learnerdrake interesting. We/I should extend it with a pipelines example.

pat-s avatar Apr 19 '20 21:04 pat-s

Review Bernd:

  • Should PipeOp Ids be cached
  • Release this in a separate release
  • Add blogpost, showcase results
  • Thoroughly testride

pfistfl avatar Apr 30 '20 12:04 pfistfl