gensim icon indicating copy to clipboard operation
gensim copied to clipboard

Phrases.analyse_sentence() performs a greedy search for phrases

Open chaturv3di opened this issue 2 years ago • 1 comments

Problem description

When computing phrases, it is desirable that Phrases.analyze_sentence() implement a certain look-ahead and return phrases with higher scores. Let's say we have bigrams ('a_b', 0.25) and ('b_c', 0.45) and a s = ['a', 'b', 'c']. At present, the return value of phrases.analyze_sentence(s) is going to be:

[('a_b', 0.25),
 ('c'), None]

although, since the second bigram has a higher score, it makes sense for the return value to be:

[('a', None),
 ('b_c'), 0.45]

Is there value in implementing this, optimised version of analyze_sentence()?

[Update] PS: Would it be okay to (eventually) raise a PR for this?

Versions

Linux-4.19.0-21-cloud-amd64-x86_64-with-debian-10.12
Python 3.7.12 | packaged by conda-forge | (default, Oct 26 2021, 06:08:53) 
[GCC 9.4.0]
Bits 64
NumPy 1.19.5
SciPy 1.7.3
gensim 4.2.0
FAST_VERSION 0

chaturv3di avatar Jul 21 '22 01:07 chaturv3di

This is essentially a duplicate of #1719, which includes some discussion (including considerations when multiple Phrases models are stacked). I'll update its title to make it easier to find.

gojomo avatar Jul 25 '22 18:07 gojomo