FEAT Develop the `AggJoiner` and `AggTarget`
This meta-issue is a roadmap for the developments of the AggJoiner and AggTarget estimators, currently implemented in #600.
We need to merge the following PRs before tackling any of the developments below.
- [x] #733
- [x] #600
This is not a personal roadmap. Anyone is welcome to contribute! 🙂
- [x] Add add
keyargument As discussed in https://github.com/skrub-data/skrub/discussions/751 - [x] Add MultiAggJoiner and MultiJoiner
JoinerandAggJoinercurrently accept multiple aux tables, but this creates complexity because some parameters can be iterable of iterable of str. Instead, we need to createMulti{AggJoiner, Joiner}to handle multi tables, and only accept a single table otherwise. - [ ] Separate
countfrom the other operations The count operation is currently applied on all categorical columns, which create duplicates. Also, we want to use it even if we only have numerical columns. So, we need to treat this operation separately. - [ ] Add column specific operations
Selected operations are applied over all categorical or numerical columns. We could let the user select columns on which to execute specific operations. We could either achieve this with a column-operation mapping keyword argument, or a list of tuples, inspired by
TableVectorizer'scolumn_specific_transformers. - [ ] Support
polars.LazyFrameLazyframes don't have the__dataframe__attribute. A duck-typing likelazy_df.__dataframe__ = Noneis enough to get it working, but we might want a less hacky solution. - [ ] Support
value_countsandhistoperations for Polars - [ ] Add more operations for both Polars and Pandas dataframes/lazyframes:
- [ ] topk(k)
- [ ] unique-count
- [ ] quantile(q)
- [ ] first
- [ ] last
- [ ] Support fuzzy join
This could be implemented using the
fuzzy_joinfunction or composing theJoinerclass. - [ ] Enable screening When running multiple aggregation operations over features, screening would enable us to select the columns correlating with the target before joining them.
Any other ideas?
- [ ] cross-fitting in aggtarget
- [ ] shrinking towards global mean
I'm working on adding the key argument, and the MultiAggJoiner :)
Hey! Would having Bayesian means be useful? It seems to me the spirit of skrub is to provide good defaults to users. One thing that can happen with AggTarget and AggJoiner out of the box is to do aggregates on small groups, which can lead to overfitting
shrinking towards the overall aggregate across groups sounds like a useful option to add
I guess it only applies for some of the aggregation operations
I guess it only applies for some of the aggregation operations
My initial understand of skrub is that it should provide good defaults to users. It's nice to have AggJoiner and AggTarget be able to compute a variety of statistics. But in practice (e.g. Kaggle) it's more or less sufficient to compute the mean. So if these classes are going to be used as part of TableVectorizer, maybe the default could be a mean that shrinks towards the overall mean. I think this would be a good pit of success for most users.
Hi @MaxHalford, that sounds like a good idea, and something that is already performed in scikit-learn TargetEncoder. We should create a small benchmark to explore this idea in AggJoiner and AggTarget.
I'd be interested to work on screening when I'm done with the multi joiners. I think it's a feature that might also be useful for other estimators than the Joiners, if you think that's the case we can open a new issue on this topic
as discussed with @TheooJ , the AggTarget does not implement cross-fitting (see eg the target encoder doc) which can cause serious overfitting of the downstream estimator. moreover shrinking/smoothing is probably important when there are some values with few matches in the joining column