skrub icon indicating copy to clipboard operation
skrub copied to clipboard

META - Improvements for the Data Ops

Open rcap107 opened this issue 3 months ago • 1 comments

This issue is intended to track improvements and fixes for the Data Ops that should be implemented in the short term.

Needed fixes

Needed features

  • [ ] Rework joins/move joins to the .skb namespace
  • [ ] Add set_data(): if I have a Data Op with empty values, this lets me go back to interactive mode
  • [ ] Improve performance when adding a data op (at the moment, the operation has quadratic complexity)
  • [ ] Allow passing kwargs to fit, predict, cross_validate
  • [ ] Add a way to reweigh choose_from so that certain branches are not chosen more often than others
  • [ ] Add a "verbose" parameter to .skb.cross_validate to show the search space used to train the model, maybe with a message saying "this was trained on default parameters"
  • [ ] Add a way to track the predict results in skrub.cross_validate and .skb.cross_validate. At the moment, it's very hard to export all the results of the cross-validation fold, but this is useful info for evaluating the performance of the models.
  • [ ] Add truncated_after to Data Ops. At the moment, it's only available for learners.

TODO examples

  • [ ] Example with optuna (needs #1661)
  • [ ] Example with XGBoost/CatBoost (needs #1642
  • [ ] Example with MLFlow

Open Issues

  • #1604
  • #1494
  • #1487
  • #1295

PRs

  • #1623
  • #1511
  • #1654
  • #1653
  • #1646
  • #1642

rcap107 avatar Sep 22 '25 07:09 rcap107

I added the section on examples with external libraries

rcap107 avatar Oct 07 '25 09:10 rcap107