mlr3book icon indicating copy to clipboard operation
mlr3book copied to clipboard

Redoing pipelines chapter

Open mb706 opened this issue 2 years ago • 3 comments

Current state: mlr3-book.pdf

mb706 avatar Jul 04 '22 20:07 mb706

Do we have to be super careful with merge conflicts here or are you basically just restructuring everything?

mllg avatar Jul 20 '22 12:07 mllg

I am reordering the whole chapter, so no problem with merge conflicts

mb706 avatar Jul 25 '22 15:07 mb706

NB: the framework comparison part is moved into the attic folder. I think it needs to be revised.

mllg avatar Aug 08 '22 15:08 mllg

@mb706 what's the status with this chapter? I've made a lot of comments about the pipelines chapter on main branch but unsure how much are relevant given this PR

RaphaelS1 avatar Oct 30 '22 18:10 RaphaelS1

This one will replace the chapter in the master. I am currently a bit overwhelmed with other responsibilities, however, so will probably only be able to finish it in two weeks or so. I've added a note at the top of https://mlr3book.mlr-org.com/pipelines.html . Sorry if this was too late for your read-through, I was only told about it after the fact apparently.

mb706 avatar Oct 31 '22 04:10 mb706

No worries! I'll just do another review based on this branch, just wanted to confirm this was replacing master

RaphaelS1 avatar Oct 31 '22 19:10 RaphaelS1

  • [x] examples of what can be implemented: would clarify for the data manipulation example that mlr3pipelines doesn't make implementing the actual methods easier, but their combination.
  • [x] might be worth mentioning at the beginning that pipelines are DAGs and not general graphs (I assume that loops don't make sense?).
  • [x] 6.1. first code example: why is there a dot in scale. = TRUE?
  • [x] 6.1. at the end I would mention that you can also turn graphs that have a prediction aggregation step as the last operation into learners.
  • [x] 6.3.4 I think it would be more intuitive to show the intermediate results after mutate, as the "intermediate" result shown currently is also the final result.
  • [x] 6.3 expected a subsection on non-sequential graphs there. I would change the title of 6.3.1 to not set up that expectation.
  • [x] 6.4 say explicitly what the difference between a graph and a learner is -- graphs have $train() and $predict() methods, so from the explanations so far it looks like they should be the same.
  • [x] 6.4.2 "information about the parameters" -> "information about the hyperparameters".
  • [x] 6.4.2 the example that shows the param names are prefixed with the IDs when you get the learner out from the graph is a bit long -- I would put the text into a "warning" box or something like that and omit the code example.
  • [x] 6.5.1 would make sense to select the triceps and insulin columns to be shown here so that the reader can see the imputed values.
  • [x] 6.5.2 examples with affect_columns and select give different results for the PC* values -- maybe need to set the seed again for the second example?
  • [x] 6.5.3 not sure if the quote makes it clearer -- "bootstrap replicates" and "learning set" aren't explained. I would remove the quote.
  • [x] 6.5.3 downsampling -> subsampling
  • [x] Figure 6.5 figure says "classif.avg" and caption "majority vote". I would point out that the pipeop was named for consistency, but does majority vote.
  • [x] 6.5.3 last example: what does collect_multiplicity = TRUE do?
  • [x] 6.5.3 last paragraph: rpart -> ranger
  • [x] 6.6 might be better to tune of the examples given previously instead of introducing something different with a different dataset.
  • [x] 6.7 add an example on how to define a custom ppl().
  • [x] Add conclusion, exercises.

larskotthoff avatar May 06 '23 20:05 larskotthoff

Ping @mb706 .

larskotthoff avatar May 12 '23 16:05 larskotthoff

Good review, thanks. Most of these fixes are now in 41c2bd020b21b96ca2bbba1d72fc2ec8f2b2665d

See my comments about the boxes I did not tick for potential things to discuss.

examples of what can be implemented: would clarify for the data manipulation example that mlr3pipelines doesn't make implementing the actual methods easier, but their combination.

Changing the sentence to "operations that can be performed", to be more equivocal about whether things are implemented or just used. E.g. I would argue we are actually "implementing" CASH here.

might be worth mentioning at the beginning that pipelines are DAGs and not general graphs (I assume that loops don't make sense?).

added a {.callout-tip}

6.1. first code example: why is there a dot in scale. = TRUE?

This is the name of the argument of the prcomp function underlying the po("pca") and therefore the name of the hyperparameter. I could add a comment here, but then I could also comment all other hyperparameter names. Do you think this is necessary?

6.1. at the end I would mention that you can also turn graphs that have a prediction aggregation step as the last operation into learners.

6.1 was supposed to be relatively short to give a very quick overview (it was much longer before but I was already told to shorten it). If others think this particular point should also be part in the intro example I can put it in, but I will leave this up for discussion for now.

6.3.4 I think it would be more intuitive to show the intermediate results after mutate, as the "intermediate" result shown currently is also the final result.

good catch, the graph used to be more complicated.

6.3 expected a subsection on non-sequential graphs there. I would change the title of 6.3.1 to not set up that expectation.

Changed the title to "Basics: Building a Graph"

6.4 say explicitly what the difference between a graph and a learner is -- graphs have $train() and $predict() methods, so from the explanations so far it looks like they should be the same.

expanded the explanation before graph_learner = as_learner(gr).

6.4.2 "information about the parameters" -> "information about the hyperparameters".

done

6.4.2 the example that shows the param names are prefixed with the IDs when you get the learner out from the graph is a bit long -- I would put the text into a "warning" box or something like that and omit the code example.

done, and added one more line of explanation.

6.5.1 would make sense to select the triceps and insulin columns to be shown here so that the reader can see the imputed values

done

6.5.2 examples with affect_columns and select give different results for the PC* values -- maybe need to set the seed again for the second example?

Well spotted -- I thought the problem here is that PCs are only defined up to a sign, but that was not the case, instead you actually found a bug in the code. I added a warning for users to watch out for this particular kind of bug.

6.5.3 not sure if the quote makes it clearer -- "bootstrap replicates" and "learning set" aren't explained. I would remove the quote.

done

6.5.3 downsampling -> subsampling

done

Figure 6.5 figure says "classif.avg" and caption "majority vote". I would point out that the pipeop was named for consistency, but does majority vote.

added a sentence to the caption

last example: what does collect_multiplicity = TRUE do?

added a short explanation

6.5.3 last paragraph: rpart -> ranger

good find

6.6 might be better to tune of the examples given previously instead of introducing something different with a different dataset.

hard veto. I went through a lot of iterations to find an example that is illustrative to some degree by being a pipeline that is not overly complicated, while still showing some result from tuning. OTOH the mnist task also does not really work well on the sections before, since it has nothing interesting otherwise (no missings to remove, no subsets of features one might want to operate on) and is also difficult to print ins a sensible way.

6.7 add an example on how to define a custom ppl()

Do we really want this? I thought we removed most of the other "extending" things from the book.

Add conclusion, exercises.

done

mb706 avatar May 14 '23 20:05 mb706

Regarding the scale. HP -- I would explain this particular one, as the dot looks odd and might be overlooked, which would then lead to unexpected results.

Regarding mentioning that graphs with an aggregation step as the last operation can be converted into learners, I would definitely mention this, in particular as it's only a sentence.

It would be good to show somewhere how to make a custom ppl(), but maybe in the extending chapter? Other opinions?

larskotthoff avatar May 25 '23 18:05 larskotthoff

I am explaining the scale. thing now, will discuss about the other points.

mb706 avatar May 25 '23 22:05 mb706

Had a call with Bernd, who said the remaining points are fine as they are.

mb706 avatar Jun 05 '23 20:06 mb706