gensim
gensim copied to clipboard
PERF: pyemd to POT for EMD computation in `wmdistance`
fixes #3312
Why this change?
- Speed:
import gensim.downloader as api
corpus = api.load('text8')
model = api.load("glove-wiki-gigaword-50")
corpus = [d for i, d in enumerate(corpus) if i < 2]
Timings for the following call (took a subset because I'm on an old laptop right now):
model.wmdistance(corpus[0][:1000], corpus[1][:1000])
pyemd (before) | POT (after) |
---|---|
30s | 84ms |
- Provides other algorithms that could be used to introduce variants of the WMD (see discussion in #3312)
-
Actively maintained: last release in December 2021 vs January 2018 for
pyemd
- Has good documentation
- Same MIT license
Thanks. Can you add some benchmarks into the PR description?
Something to point people to, a TL;DR for the release notes.
Also a link to this "POT" dependency, its license etc. Is it a maintained / stable library?
I've updated my first comment with the info, but maybe you meant I should amend the first commit's message with all this info?
Hi - Is this PR still being worked on? Happy to take over if needed. I just swapped out pyemd for POT and it improved performance by around a factor of 3.
I'm still able to work on the PR, but now I'm simply waiting for a review.
From a read-over on Github (no personal testing), this looks good. The POT
package also appears to have far more attention/contributors/recent-development than pyemd
, which I consider a good sign. +1 on including though whatever the tiny conflict is should be resolved.
As an aside – not at all a blocker for this issue, and not at all any knock about the work done here, which follows project precedents – it's a bit unfortunate, a "project smell", that the tiny functional change requires changes scattered across 16 files, with many bits of updated comments/boilerplate. The core of the change is small & well-contained: import a different library, replace one function call, in keyedvectors.py
. Obviously, some changes to docs & declared-requirements are unavoidable.
But all the conditionals in tests would ideally be oblivious to implementation, perhaps by using a single is-support-available test rather than a dependency-specific check. And all the autogenerated output changes (eg in auto_examples
stuff) would ideally happen separately in some way that wouldn't burden evaluation of the functional change, expanding the magnitude of this diff, or even risking potential human error or other unrelated changes sneaking in inadvertently.
But that's all more: thoughts for project going forward than anything requiring immediate changes.
@TLouf can you fix the conflict please?
After that we should be ready to merge @mpenkov – unless you see something off!
Thanks @TLouf!
One other thing we may wish to do - there is a function provided by POT to calculate distances. It falls back on scipy's cdist if metric is not euclidean.
Do we want to switch to that?
One other thing we may wish to do - there is a function provided by POT to calculate distances. It falls back on scipy's cdist if metric is not euclidean.
Do we want to switch to that?
Switch to it in which places, in pursuit of what advantage?
Switch to it in which places
This line to ot.dist
.
in pursuit of what advantage?
First, since we are already using POT for emd
, we might as well be using POT's implementation of distance.
Second, it provide the flexibility to use other backends. AFAIK, Gensim wouldn't be taking advantage of it but provide an option for future extensions.
Neither are strong reasons but thought I'd raise it in case others feel more strongly.
As it, ot.dist
replaces cdist
on that line? I'd generally prefer to avoid extra changes unless/until there's a tangible benefit, in perf, functionality, or readability.
(Perhaps, making distance-measure configurable improves performance in some cases. If we know that to be true via investigation, then it could be a good change... but even then, better as a focused PR with its own rationale, rather than mixed with this sort of drop-in, functionality-maintaining performance boost.)
I got this in CI, can @TLouf please double check?
The gallery cache appears stale.
Rebuild the documentation using the following commands from the gensim root subdirectory:
pip install -e .[docs]
make -C docs/src html
and then run `git add docs/src/auto_examples` to update the cache.
Stale files: ['docs/src/auto_examples/tutorials/run_wmd.py', 'docs/src/auto_examples/tutorials/run_fasttext.py']
@mpenkov just went along with CI's recommendation, however it added a ton of files, some of which I should not git add
I think, like all the index.rst
and the .png
s. Let me know if that's correct, and if I should exclude more, and I'll revert the commit to only include what's necessary.
No, I think you did everything right, with the exception of those two files that the CI thinks you did not add. Can you add them? Or am I misunderstanding something?
With what I added in my last commit I think CI should be happy, the 2 python files were already added, but I hadn't included some cache files generated by those scripts, from what I understood.
Hi all - Can this be merged? I'd love to use this from the dev branch. Thanks for your work on this.
Yes, of course. Thank you for reminding me.
@TLouf Thank you for your work!