skrub FEAT Develop the `AggJoiner` and `AggTarget`

This meta-issue is a roadmap for the developments of the AggJoiner and AggTarget estimators, currently implemented in #600.

We need to merge the following PRs before tackling any of the developments below.

[x] #733
[x] #600

This is not a personal roadmap. Anyone is welcome to contribute! 🙂

[x] Add add key argument As discussed in https://github.com/skrub-data/skrub/discussions/751
[x] Add MultiAggJoiner and MultiJoiner Joiner and AggJoiner currently accept multiple aux tables, but this creates complexity because some parameters can be iterable of iterable of str. Instead, we need to create Multi{AggJoiner, Joiner} to handle multi tables, and only accept a single table otherwise.
[ ] Separate count from the other operations The count operation is currently applied on all categorical columns, which create duplicates. Also, we want to use it even if we only have numerical columns. So, we need to treat this operation separately.
[ ] Add column specific operations Selected operations are applied over all categorical or numerical columns. We could let the user select columns on which to execute specific operations. We could either achieve this with a column-operation mapping keyword argument, or a list of tuples, inspired by TableVectorizer's column_specific_transformers.
[ ] Support polars.LazyFrame Lazyframes don't have the __dataframe__ attribute. A duck-typing like lazy_df.__dataframe__ = None is enough to get it working, but we might want a less hacky solution.
[ ] Support value_counts and hist operations for Polars
[ ] Add more operations for both Polars and Pandas dataframes/lazyframes:
- [ ] topk(k)
- [ ] unique-count
- [ ] quantile(q)
- [ ] first
- [ ] last
[ ] Support fuzzy join This could be implemented using the fuzzy_join function or composing the Joiner class.
[ ] Enable screening When running multiple aggregation operations over features, screening would enable us to select the columns correlating with the target before joining them.

Any other ideas?

[ ] cross-fitting in aggtarget
[ ] shrinking towards global mean

Sep 05 '23 12:09 Vincent-Maladiere

I'm working on adding the key argument, and the MultiAggJoiner :)

Dec 18 '23 21:12 TheooJ

Hey! Would having Bayesian means be useful? It seems to me the spirit of skrub is to provide good defaults to users. One thing that can happen with AggTarget and AggJoiner out of the box is to do aggregates on small groups, which can lead to overfitting

Dec 22 '23 19:12 MaxHalford

shrinking towards the overall aggregate across groups sounds like a useful option to add

Dec 23 '23 11:12 jeromedockes

I guess it only applies for some of the aggregation operations

Dec 23 '23 11:12 jeromedockes

I guess it only applies for some of the aggregation operations

My initial understand of skrub is that it should provide good defaults to users. It's nice to have AggJoiner and AggTarget be able to compute a variety of statistics. But in practice (e.g. Kaggle) it's more or less sufficient to compute the mean. So if these classes are going to be used as part of TableVectorizer, maybe the default could be a mean that shrinks towards the overall mean. I think this would be a good pit of success for most users.

Dec 23 '23 11:12 MaxHalford

Hi @MaxHalford, that sounds like a good idea, and something that is already performed in scikit-learn TargetEncoder. We should create a small benchmark to explore this idea in AggJoiner and AggTarget.

Dec 24 '23 12:12 Vincent-Maladiere

I'd be interested to work on screening when I'm done with the multi joiners. I think it's a feature that might also be useful for other estimators than the Joiners, if you think that's the case we can open a new issue on this topic

Jan 17 '24 12:01 TheooJ

as discussed with @TheooJ , the AggTarget does not implement cross-fitting (see eg the target encoder doc) which can cause serious overfitting of the downstream estimator. moreover shrinking/smoothing is probably important when there are some values with few matches in the joining column

Mar 13 '24 13:03 jeromedockes