Jérôme Dockès comments

Results 397 comments of


                                            Jérôme Dockès

FEAT Develop the `AggJoiner` and `AggTarget`

as discussed with @TheooJ , the AggTarget does not implement cross-fitting (see eg the [target encoder doc](https://scikit-learn.org/stable/modules/preprocessing.html#target-encoder)) which can cause serious overfitting of the downstream estimator. moreover shrinking/smoothing is probably...

MAINT Support Python 3.8+

TODO - [ ] reduce number of CI builds

MAINT Support Python 3.8+

thanks for reviving this @parkervg ! I guess haven't really made a decision on this one as we were focusing on other things; I'll update the PR and we'll make...

MAINT Support Python 3.8+

ok opening a new PR seemed easier than fixing the conflicts so this is superseded by https://github.com/skrub-data/skrub/pull/939

MAINT Support Python 3.8+

@parkervg the next release of skrub will support python 3.9 (not 3.8 which is not supported by scikit-learn and will be EOL in October)

Consider casting to float32 by default in TableVectorizer

that also applies (maybe even more) to encoders, for example MinHash outputs float64

Handle numerical missing values in TableVectorizer

still the goal of the tablevectorizer is to prepare a table so that the rest of the pipeline will work on it without problems, quite a few estimators lack support...

Handle numerical missing values in TableVectorizer

But I agree that at least the default should probably be to output nans where there are missing values as is currently the case

[FEAT] Add MultiAggJoiner, refactor AggJoiner

so IIUC: AggJoiner: aux_table, key, main_key, aux_key, suffix, cols, operations MultiAggJoiner: aux_table**s**, key**s**, main_key**s**, aux_key**s**, suffix**es**, cols, operations is that correct? sounds ok to me

Better threshold metric for fuzzy_join

following up on the skrub meeting discussion and relevant for #821: ## Parametrizing the fuzzy join threshold In fuzzy join, - we vectorize the join attributes - we pair each...