LeoGrin

Results 17 comments of LeoGrin

I don't understand why the coverage has changed. It seems that the function called by joblib.Parallel (`compute_hash`) is not counted in the coverage, but I may be reading codecov wrong.

We use np.unique, so for one transform we don't recompute repeated entries. Using self.hash_dict would indeed speed things up if we transform several inputs with common entries, using the same...

@GaelVaroquaux But are people often using the same encoder to transform several Xs?

@GaelVaroquaux just want to make sure I understood what you were saying before putting the hash_dict back in the code.

Following discussion with @GaelVaroquaux : using the same encoder to transform several Xs may happen in online learning settings, for instance with a big X. I've put the hash_dict back...

Awesome! Some additional things which could be useful: - benchmark on datasets with typos, on abbreviations (@alexis-cvetkov was saying that we could use the is_abbrevation tag in YAGO for this),...

Great! Maybe it would be cool to add one line on the SuperVectorizer in "What can dirty-cat do?" ? It doesnt' really fit in the rest of the description and...

- I've run the benchmark (thanks Lilian) on Margaret, and it confirms that the batched version seems to always be better. I've also added to the benchmark a `batch_per_job` parameter,...

See more discussion at this skrub's PR: https://github.com/skrub-data/skrub/pull/592 An idea: maybe add an option to the Column transformer to either 1) parallelize on the transformers and don't parallelize each transformer...