LeoGrin
LeoGrin
I have the same issue on Wandb 13.1
I don't understand why the coverage has changed. It seems that the function called by joblib.Parallel (`compute_hash`) is not counted in the coverage, but I may be reading codecov wrong.
We use np.unique, so for one transform we don't recompute repeated entries. Using self.hash_dict would indeed speed things up if we transform several inputs with common entries, using the same...
@GaelVaroquaux But are people often using the same encoder to transform several Xs?
@GaelVaroquaux just want to make sure I understood what you were saying before putting the hash_dict back in the code.
Following discussion with @GaelVaroquaux : using the same encoder to transform several Xs may happen in online learning settings, for instance with a big X. I've put the hash_dict back...
Awesome! Some additional things which could be useful: - benchmark on datasets with typos, on abbreviations (@alexis-cvetkov was saying that we could use the is_abbrevation tag in YAGO for this),...
Great! Maybe it would be cool to add one line on the SuperVectorizer in "What can dirty-cat do?" ? It doesnt' really fit in the rest of the description and...
- I've run the benchmark (thanks Lilian) on Margaret, and it confirms that the batched version seems to always be better. I've also added to the benchmark a `batch_per_job` parameter,...
See more discussion at this skrub's PR: https://github.com/skrub-data/skrub/pull/592 An idea: maybe add an option to the Column transformer to either 1) parallelize on the transformers and don't parallelize each transformer...