skrub
skrub copied to clipboard
Min hash parallel
Compute the min hash transform method in parallel, as suggested by @alexis-cvetkov.
We no longer use the self.hash_dict attribute, so the fit method does nothing now.
I don't understand why the coverage has changed. It seems that the function called by joblib.Parallel (compute_hash
) is not counted in the coverage, but I may be reading codecov wrong.
I worry that the self.hash_dict was really useful to speed things up by avoiding recomputation of repeated entries
We use np.unique, so for one transform we don't recompute repeated entries. Using self.hash_dict would indeed speed things up if we transform several inputs with common entries, using the same encoder. Do you think that's likely to happen / worth the additional memory usage ?
On Wed, Jun 29, 2022, 20:59 Gael Varoquaux @.***> wrote:
I worry that the self.hash_dict was really useful to speed things up by avoiding recomputation of repeated entries
— Reply to this email directly, view it on GitHub https://github.com/dirty-cat/dirty_cat/pull/267#issuecomment-1170376511, or unsubscribe https://github.com/notifications/unsubscribe-auth/AK46V2FGQSUNMHVILR5MBODVRSMIDANCNFSM52GSCHYA . You are receiving this because you authored the thread.Message ID: @.***>
Do you think that's likely to happen / worth the additional memory usage ?
Yes: it's very frequent. Typically, the entries are repeated many times.
@GaelVaroquaux But are people often using the same encoder to transform several Xs?
@GaelVaroquaux just want to make sure I understood what you were saying before putting the hash_dict back in the code.
Following discussion with @GaelVaroquaux : using the same encoder to transform several Xs may happen in online learning settings, for instance with a big X. I've put the hash_dict back in
Benchmarking really quickly on my Mac M1 with 8 cores, the rows version is about twice faster than the batched version, but we should do a more serious benchmark.
On a dataset of which size (number of rows)?
As discussed with @jovan-stojanovic privately regarding understanding the current implementation performance: one can profile it using profilers like py-spy
and visually search for bottlenecks using Speed-scope
In that case, one can get a py-spy
profile for Speedcope using:
py-spy record --native -o py-spy-profile.svg -f speedscope -- python ./script.py
where script.py
is the script using MinHash
on a suitable dataset for dirty_cat
.
If MinHash
relies on joblib
for parallelization, it might be better to use a sole thread (by setting n_jobs=1
) so as to better understand the execution when inspecting the profile.
After having uploaded the profile on Speedscope's UI, you will be able to have access to various representations:
-
"Left Heavy"
to get a flame graph recapitulating the total runtime of each functions, we the heavier first -
"Time Order"
to understand the order of function call in their context. -
"Sandwich"
for fine-grained information for each interfaces
The changelog test apparently works fine :) You should add the PR number to CHANGES.rst
-
I've run the benchmark (thanks Lilian) on Margaret, and it confirms that the batched version seems to always be better. I've also added to the benchmark a
batch_per_job
parameter, but it had no influence on the results. -
I've put the minhashencoder code with all options in the bench_minhash function, for reproducibility.
-
In the original minhashencoder file, I've only kept the batched version, and removed the parameter to choose.
Ready for review!
Just waiting for Jovan's review, and we'll be ready to merge :)
Merging! Thanks @LeoGrin