gensim PERF: pyemd to POT for EMD computation in `wmdistance`

fixes #3312

Why this change?

Speed:

import gensim.downloader as api
corpus = api.load('text8')
model = api.load("glove-wiki-gigaword-50")
corpus = [d for i, d in enumerate(corpus) if i < 2]

Timings for the following call (took a subset because I'm on an old laptop right now):

model.wmdistance(corpus[0][:1000], corpus[1][:1000])

pyemd (before)	POT (after)
30s	84ms

Provides other algorithms that could be used to introduce variants of the WMD (see discussion in #3312)
Actively maintained: last release in December 2021 vs January 2018 for pyemd
Has good documentation
Same MIT license

Apr 16 '22 10:04 TLouf

Thanks. Can you add some benchmarks into the PR description?

Something to point people to, a TL;DR for the release notes.

Also a link to this "POT" dependency, its license etc. Is it a maintained / stable library?

Apr 16 '22 11:04 piskvorky

I've updated my first comment with the info, but maybe you meant I should amend the first commit's message with all this info?

Apr 16 '22 22:04 TLouf

Hi - Is this PR still being worked on? Happy to take over if needed. I just swapped out pyemd for POT and it improved performance by around a factor of 3.

Oct 17 '22 19:10 sidravi1

I'm still able to work on the PR, but now I'm simply waiting for a review.

Oct 19 '22 17:10 TLouf

From a read-over on Github (no personal testing), this looks good. The POT package also appears to have far more attention/contributors/recent-development than pyemd, which I consider a good sign. +1 on including though whatever the tiny conflict is should be resolved.

As an aside – not at all a blocker for this issue, and not at all any knock about the work done here, which follows project precedents – it's a bit unfortunate, a "project smell", that the tiny functional change requires changes scattered across 16 files, with many bits of updated comments/boilerplate. The core of the change is small & well-contained: import a different library, replace one function call, in keyedvectors.py. Obviously, some changes to docs & declared-requirements are unavoidable.

But all the conditionals in tests would ideally be oblivious to implementation, perhaps by using a single is-support-available test rather than a dependency-specific check. And all the autogenerated output changes (eg in auto_examples stuff) would ideally happen separately in some way that wouldn't burden evaluation of the functional change, expanding the magnitude of this diff, or even risking potential human error or other unrelated changes sneaking in inadvertently.

But that's all more: thoughts for project going forward than anything requiring immediate changes.

Oct 19 '22 22:10 gojomo

@TLouf can you fix the conflict please?

After that we should be ready to merge @mpenkov – unless you see something off!

Oct 20 '22 06:10 piskvorky

Thanks @TLouf!

One other thing we may wish to do - there is a function provided by POT to calculate distances. It falls back on scipy's cdist if metric is not euclidean.

Do we want to switch to that?

Oct 20 '22 11:10 sidravi1

One other thing we may wish to do - there is a function provided by POT to calculate distances. It falls back on scipy's cdist if metric is not euclidean.

Do we want to switch to that?

Switch to it in which places, in pursuit of what advantage?

Oct 20 '22 17:10 gojomo

Switch to it in which places

This line to ot.dist.

in pursuit of what advantage?

First, since we are already using POT for emd, we might as well be using POT's implementation of distance. Second, it provide the flexibility to use other backends. AFAIK, Gensim wouldn't be taking advantage of it but provide an option for future extensions.

Neither are strong reasons but thought I'd raise it in case others feel more strongly.

Oct 21 '22 01:10 sidravi1

As it, ot.dist replaces cdist on that line? I'd generally prefer to avoid extra changes unless/until there's a tangible benefit, in perf, functionality, or readability.

(Perhaps, making distance-measure configurable improves performance in some cases. If we know that to be true via investigation, then it could be a good change... but even then, better as a focused PR with its own rationale, rather than mixed with this sort of drop-in, functionality-maintaining performance boost.)

Oct 21 '22 17:10 gojomo

I got this in CI, can @TLouf please double check?

The gallery cache appears stale.

Rebuild the documentation using the following commands from the gensim root subdirectory:

    pip install -e .[docs]
    make -C docs/src html

and then run `git add docs/src/auto_examples` to update the cache.

Stale files: ['docs/src/auto_examples/tutorials/run_wmd.py', 'docs/src/auto_examples/tutorials/run_fasttext.py']

Oct 22 '22 09:10 mpenkov

@mpenkov just went along with CI's recommendation, however it added a ton of files, some of which I should not git add I think, like all the index.rst and the .pngs. Let me know if that's correct, and if I should exclude more, and I'll revert the commit to only include what's necessary.

Oct 23 '22 09:10 TLouf

No, I think you did everything right, with the exception of those two files that the CI thinks you did not add. Can you add them? Or am I misunderstanding something?

Oct 23 '22 10:10 mpenkov

With what I added in my last commit I think CI should be happy, the 2 python files were already added, but I hadn't included some cache files generated by those scripts, from what I understood.

Oct 24 '22 09:10 TLouf

Hi all - Can this be merged? I'd love to use this from the dev branch. Thanks for your work on this.

Nov 01 '22 15:11 sidravi1

Yes, of course. Thank you for reminding me.

@TLouf Thank you for your work!

Nov 03 '22 13:11 mpenkov

gensim gensim copied to clipboard

PERF: pyemd to POT for EMD computation in `wmdistance`

gensim
gensim copied to clipboard