skrub Min hash parallel

Compute the min hash transform method in parallel, as suggested by @alexis-cvetkov.

We no longer use the self.hash_dict attribute, so the fit method does nothing now.

Jun 29 '22 17:06 LeoGrin

I don't understand why the coverage has changed. It seems that the function called by joblib.Parallel (compute_hash) is not counted in the coverage, but I may be reading codecov wrong.

Jun 29 '22 18:06 LeoGrin

I worry that the self.hash_dict was really useful to speed things up by avoiding recomputation of repeated entries

Jun 29 '22 18:06 GaelVaroquaux

We use np.unique, so for one transform we don't recompute repeated entries. Using self.hash_dict would indeed speed things up if we transform several inputs with common entries, using the same encoder. Do you think that's likely to happen / worth the additional memory usage ?

On Wed, Jun 29, 2022, 20:59 Gael Varoquaux @.***> wrote:

I worry that the self.hash_dict was really useful to speed things up by avoiding recomputation of repeated entries

— Reply to this email directly, view it on GitHub https://github.com/dirty-cat/dirty_cat/pull/267#issuecomment-1170376511, or unsubscribe https://github.com/notifications/unsubscribe-auth/AK46V2FGQSUNMHVILR5MBODVRSMIDANCNFSM52GSCHYA . You are receiving this because you authored the thread.Message ID: @.***>

Jun 29 '22 19:06 LeoGrin

Do you think that's likely to happen / worth the additional memory usage ?

Yes: it's very frequent. Typically, the entries are repeated many times.

Jun 29 '22 19:06 GaelVaroquaux

@GaelVaroquaux But are people often using the same encoder to transform several Xs?

Jun 29 '22 19:06 LeoGrin

@GaelVaroquaux just want to make sure I understood what you were saying before putting the hash_dict back in the code.

Jul 06 '22 14:07 LeoGrin

Following discussion with @GaelVaroquaux : using the same encoder to transform several Xs may happen in online learning settings, for instance with a big X. I've put the hash_dict back in

Jul 12 '22 17:07 LeoGrin

Benchmarking really quickly on my Mac M1 with 8 cores, the rows version is about twice faster than the batched version, but we should do a more serious benchmark.

On a dataset of which size (number of rows)?

Oct 08 '22 20:10 GaelVaroquaux

As discussed with @jovan-stojanovic privately regarding understanding the current implementation performance: one can profile it using profilers like py-spy and visually search for bottlenecks using Speed-scope

In that case, one can get a py-spy profile for Speedcope using:

py-spy record --native -o py-spy-profile.svg -f speedscope -- python ./script.py

where script.py is the script using MinHash on a suitable dataset for dirty_cat.

If MinHash relies on joblib for parallelization, it might be better to use a sole thread (by setting n_jobs=1) so as to better understand the execution when inspecting the profile.

After having uploaded the profile on Speedscope's UI, you will be able to have access to various representations:

"Left Heavy" to get a flame graph recapitulating the total runtime of each functions, we the heavier first
"Time Order" to understand the order of function call in their context.
"Sandwich" for fine-grained information for each interfaces

Oct 26 '22 13:10 jjerphan

The changelog test apparently works fine :) You should add the PR number to CHANGES.rst

Dec 14 '22 10:12 jovan-stojanovic

I've run the benchmark (thanks Lilian) on Margaret, and it confirms that the batched version seems to always be better. I've also added to the benchmark a batch_per_job parameter, but it had no influence on the results.
I've put the minhashencoder code with all options in the bench_minhash function, for reproducibility.
In the original minhashencoder file, I've only kept the batched version, and removed the parameter to choose.

Ready for review!

Dec 14 '22 14:12 LeoGrin

Just waiting for Jovan's review, and we'll be ready to merge :)

Dec 15 '22 16:12 LilianBoulard

Merging! Thanks @LeoGrin

Dec 16 '22 11:12 jovan-stojanovic

skrub skrub copied to clipboard

Min hash parallel

skrub
skrub copied to clipboard