skrub icon indicating copy to clipboard operation
skrub copied to clipboard

DISCUSSION: Updating and extending the user guide

Open rcap107 opened this issue 5 months ago • 6 comments

Below there is the outline of the section of the user guide relative to the DataOps. I have highlighted in bold the concepts I think should be in the user guide before the release, while the others (mostly more advanced stuff) will be added in alter PRs.

The reason for this issue is to gather feedback on what people think is important and should be in the user guide, and what instead can be de-prioritized for the time being.

Relevant PRs:

  • #1529

DataOps

Intro: DataOps make learners

  • [x] What are DataOps for?
  • [x] What are DataOps? What is a learner?
    • [x] What is the environment?
    • [x] How does a learner differ from the scikit-learn estimators?
  • [x] What are some possible use cases?
  • [x] How do expressions differ from...
    • [x] scikit-learn pipelines
    • [x] orchestrators
    • [x] other skrub objects like tabular_pipeline, TableVectorizer
  • [x] Can I use function x from library y?
  • [x] Previews and sampling for easier development
  • [x] Expressions without any data

Expressions control flow

  • [x] What does .skb.eval() do?
  • [x] Marking X and y
  • [ ] .skb.if_else and .skb.match, difference with choose_from(...).if_else and choose_from(...).match
  • [ ] as_expr: turn any object into in expression – make its methods lazy; give it the capabilities of expressions; a way to give a name to an arbitrary value and replace it at fit or predict time
  • [x] Overriding the value of an expression

Reporting

  • [x] Full report
  • [x] Documenting the expressions with .set_name and .set_description
  • [x] Preview and table report
  • [ ] Partial graph (mention vars)
  • [ ] Parallel coordinate plot

From DataOps to prediction: building and using learners

  • [x] How to build a learner
    • [x] .skb.make_learner()
  • [x] Serialization
  • [ ] Applying expressions to column subsets
    • [ ] select + transformations + concat
  • [ ] Concatenating transformers
  • [ ] Finding a fitted estimator to inspect its attributes
  • [x] Truncating a learner at a given step
  • [ ] skb.applied_estimator

Arbitrary code in pipelines: deferred, apply_func, and as_expr

  • [x] Usage on complex operations with dataframe libraries
  • [x] .skb.apply_func() and @deferred
  • [ ] Arguments and default arguments
  • [ ] Global variables and free variables in closures are evaluated before calling the function
  • [ ] Benefits of putting deferred functions in a module (get pickled by name rather than by value in cloudpickle)

Speed up development via subsampling

  • Usage
  • [x] When subsampling is not active
  • [x] Subsampling separate datasets
  • [x] Subsampling strategies

Evaluating a learner

  • [x] Splitting in train and test splits (.skb.train_test_split)
  • [ ] Custom splitter functions
  • [x] Cross-validating a learner with .skb.cross_validate
  • [x] Parallel coordinate plot
  • [ ] The unsupervised parameter of apply: when y is needed for score but not for fit
  • [ ] freeze_after_fit

Tuning choices in a DataOps plan

  • [x] choose_* functions
    • [x] Defaults
    • [x] Matching
    • [x] Nested choices
  • [x] .skb.get_grid_search and .skb.get_randomized_search
  • [x] skrub.cross_validate (explain environment)
  • [ ] choosing between completely separate pipelines using choose_from as last step

Serializing learners

  • [x] How to persist on disk
  • [x] How to use a saved learner

rcap107 avatar Jul 21 '25 14:07 rcap107

Update post 0.6.0 release: most of the "basic" content has been added, what is missing now is the more advanced material, which will be added over time.

rcap107 avatar Jul 25 '25 08:07 rcap107

Post #1574:

  • The sections on the data ops contain too much code, we should try to move that code to examples for better maintainability
  • doc/data_ops.rst should include the big toc snippet
  • Various pages have multiple h1 titles, while each page should only have one and then subtitles.

Comment from @GaelVaroquaux: https://github.com/skrub-data/skrub/pull/1574#pullrequestreview-3250142939

Many small suggestions mostly for formatting reasons.

My main comment is that the sections on DataOps goes too much in details, and people will not read through and thus not discover the features.

It should be made more high level, with mostly text and pointers, and the corresponding pages with a lot of code should be, IMHO, moved to examples.

I will apply all the suggested changes, and merge, unless I've broken something.

I have an issue with this: where are we going to put the advanced features? Or in general, "standalone features" that may be only part of an example? I am concerned that putting the detail in the examples is just going to move the problem elsewhere.

It's also hard to come up with simple examples for a lot of the more advanced features.

This is something to keep in mind for future examples.

rcap107 avatar Sep 22 '25 12:09 rcap107

It should be made more high level, with mostly text and pointers, and the corresponding pages with a lot of code should be, IMHO, moved to examples.

I have an issue with this: where are we going to put the advanced features? Or in general, "standalone features" that may be only part of an example? I am concerned that putting the detail in the examples is just going to move the problem elsewhere.

The problem is that right now, it's really hard to get a big picture of what actually DataOps do. This problem needs to be solved. Advanced features come only after having people doing less advanced ones.

One solution may be to have advanced pages for each set of features, but a main page for DataOps giving the big picture.

GaelVaroquaux avatar Sep 22 '25 20:09 GaelVaroquaux

The problem is that right now, it's really hard to get a big picture of what actually DataOps do. This problem needs to be solved. Advanced features come only after having people doing less advanced ones.

One solution may be to have advanced pages for each set of features, but a main page for DataOps giving the big picture.

I get the problem now.

What should the "main page" look like, however? Like, what parts of the current outline do you think should be put together in the same page?

rcap107 avatar Sep 22 '25 20:09 rcap107

What should the "main page" look like, however? Like, what parts of the current outline do you think should be put together in the same page?

Summaries of what we have in the separate pages, IMHO. You can even try asking an LLM to summarize, to get a first draft.

GaelVaroquaux avatar Sep 22 '25 20:09 GaelVaroquaux

What should the "main page" look like, however? Like, what parts of the current outline do you think should be put together in the same page? Summaries of what we have in the separate pages, IMHO. You can even try asking an LLM to summarize, to get a first draft.

I made a draft PR about this in #1632

rcap107 avatar Sep 24 '25 13:09 rcap107