Results 397 comments of Jérôme Dockès

as discussed with @TheooJ , the AggTarget does not implement cross-fitting (see eg the [target encoder doc](https://scikit-learn.org/stable/modules/preprocessing.html#target-encoder)) which can cause serious overfitting of the downstream estimator. moreover shrinking/smoothing is probably...

TODO - [ ] reduce number of CI builds

thanks for reviving this @parkervg ! I guess haven't really made a decision on this one as we were focusing on other things; I'll update the PR and we'll make...

ok opening a new PR seemed easier than fixing the conflicts so this is superseded by https://github.com/skrub-data/skrub/pull/939

@parkervg the next release of skrub will support python 3.9 (not 3.8 which is not supported by scikit-learn and will be EOL in October)

that also applies (maybe even more) to encoders, for example MinHash outputs float64

still the goal of the tablevectorizer is to prepare a table so that the rest of the pipeline will work on it without problems, quite a few estimators lack support...

But I agree that at least the default should probably be to output nans where there are missing values as is currently the case

so IIUC: AggJoiner: aux_table, key, main_key, aux_key, suffix, cols, operations MultiAggJoiner: aux_table**s**, key**s**, main_key**s**, aux_key**s**, suffix**es**, cols, operations is that correct? sounds ok to me

following up on the skrub meeting discussion and relevant for #821: ## Parametrizing the fuzzy join threshold In fuzzy join, - we vectorize the join attributes - we pair each...