skrub
skrub copied to clipboard
META - Improvements for the Data Ops
This issue is intended to track improvements and fixes for the Data Ops that should be implemented in the short term.
Needed fixes
Needed features
- [ ] Rework joins/move joins to the
.skbnamespace - [ ] Add
set_data(): if I have a Data Op with empty values, this lets me go back to interactive mode - [ ] Improve performance when adding a data op (at the moment, the operation has quadratic complexity)
- [ ] Allow passing kwargs to
fit,predict,cross_validate - [ ] Add a way to reweigh
choose_fromso that certain branches are not chosen more often than others - [ ] Add a "verbose" parameter to
.skb.cross_validateto show the search space used to train the model, maybe with a message saying "this was trained on default parameters" - [ ] Add a way to track the predict results in
skrub.cross_validateand.skb.cross_validate. At the moment, it's very hard to export all the results of the cross-validation fold, but this is useful info for evaluating the performance of the models. - [ ] Add
truncated_afterto Data Ops. At the moment, it's only available for learners.
TODO examples
- [ ] Example with optuna (needs #1661)
- [ ] Example with XGBoost/CatBoost (needs #1642
- [ ] Example with MLFlow
Open Issues
- #1604
- #1494
- #1487
- #1295
PRs
- #1623
- #1511
- #1654
- #1653
- #1646
- #1642
I added the section on examples with external libraries