gensim icon indicating copy to clipboard operation
gensim copied to clipboard

Gensim sort_by_descending_frequency changes most_similar results

Open ginward opened this issue 3 years ago • 5 comments

Problem description

It seems that when retrieving the most similar word vectors, sorting by word frequency will change the results in Gensim.

Steps/code/corpus to reproduce

Before sorting:

from gensim.models import FastText
from gensim.test.utils import common_texts  # some example sentences
print(len(common_texts))
model = FastText(vector_size=4, window=3, min_count=1)  # instantiate
model.build_vocab(corpus_iterable=common_texts)
model.train(corpus_iterable=common_texts, total_examples=len(common_texts), epochs=1)  

model.wv.most_similar(positive=["human"])
[('interface', 0.7432922720909119),
 ('minors', 0.6719315052032471),
 ('time', 0.3513716757297516),
 ('computer', 0.05815044790506363),
 ('response', -0.11714297533035278),
 ('graph', -0.15643596649169922),
 ('eps', -0.2679084539413452),
 ('survey', -0.34035828709602356),
 ('trees', -0.63677978515625),
 ('user', -0.6500451564788818)]

However, if I sort the vectors by descending frequency:

model.wv.sort_by_descending_frequency()

model.wv.most_similar(positive=["human"])
[('minors', 0.9638221263885498),
 ('time', 0.6335864067077637),
 ('interface', 0.40014874935150146),
 ('computer', 0.03224882856011391),
 ('response', -0.14850640296936035),
 ('graph', -0.2249641716480255),
 ('survey', -0.26847705245018005),
 ('user', -0.45202943682670593),
 ('eps', -0.497650682926178),
 ('trees', -0.6367797255516052)]

The most similar word ranking as well as the word similarities change. Any idea why?

Include full tracebacks, logs and datasets if necessary. Please keep the examples minimal ("minimal reproducible example").

If your problem is with a specific Gensim model (word2vec, lsimodel, doc2vec, fasttext, ldamodel etc), include the following:

print(my_model.lifecycle_events)
[{'params': 'FastText(vocab=0, vector_size=4, alpha=0.025)', 'datetime': '2021-07-20T09:46:56.158863', 'gensim': '4.0.1', 'python': '3.6.9 |Anaconda, Inc.| (default, Jul 30 2019, 19:07:31) \n[GCC 7.3.0]', 'platform': 'Linux-3.10.0-1160.31.1.el7.csd3.x86_64-x86_64-with-redhat-7.9-Nitrogen', 'event': 'created'}, {'msg': 'effective_min_count=1 retains 12 unique words (100.0%% of original 12, drops 0)', 'datetime': '2021-07-20T09:46:56.159995', 'gensim': '4.0.1', 'python': '3.6.9 |Anaconda, Inc.| (default, Jul 30 2019, 19:07:31) \n[GCC 7.3.0]', 'platform': 'Linux-3.10.0-1160.31.1.el7.csd3.x86_64-x86_64-with-redhat-7.9-Nitrogen', 'event': 'prepare_vocab'}, {'msg': 'effective_min_count=1 leaves 29 word corpus (100.0%% of original 29, drops 0)', 'datetime': '2021-07-20T09:46:56.160040', 'gensim': '4.0.1', 'python': '3.6.9 |Anaconda, Inc.| (default, Jul 30 2019, 19:07:31) \n[GCC 7.3.0]', 'platform': 'Linux-3.10.0-1160.31.1.el7.csd3.x86_64-x86_64-with-redhat-7.9-Nitrogen', 'event': 'prepare_vocab'}, {'msg': 'downsampling leaves estimated 3.5001157321504532 word corpus (12.1%% of prior 29)', 'datetime': '2021-07-20T09:46:56.160376', 'gensim': '4.0.1', 'python': '3.6.9 |Anaconda, Inc.| (default, Jul 30 2019, 19:07:31) \n[GCC 7.3.0]', 'platform': 'Linux-3.10.0-1160.31.1.el7.csd3.x86_64-x86_64-with-redhat-7.9-Nitrogen', 'event': 'prepare_vocab'}, {'update': False, 'trim_rule': 'None', 'datetime': '2021-07-20T09:46:56.233809', 'gensim': '4.0.1', 'python': '3.6.9 |Anaconda, Inc.| (default, Jul 30 2019, 19:07:31) \n[GCC 7.3.0]', 'platform': 'Linux-3.10.0-1160.31.1.el7.csd3.x86_64-x86_64-with-redhat-7.9-Nitrogen', 'event': 'build_vocab'}, {'msg': 'training model with 3 workers on 12 vocabulary and 4 features, using sg=0 hs=0 sample=0.001 negative=5 window=3', 'datetime': '2021-07-20T09:46:56.234068', 'gensim': '4.0.1', 'python': '3.6.9 |Anaconda, Inc.| (default, Jul 30 2019, 19:07:31) \n[GCC 7.3.0]', 'platform': 'Linux-3.10.0-1160.31.1.el7.csd3.x86_64-x86_64-with-redhat-7.9-Nitrogen', 'event': 'train'}, {'msg': 'training on 29 raw words (3 effective words) took 0.0s, 1377 effective words/s', 'datetime': '2021-07-20T09:46:56.236277', 'gensim': '4.0.1', 'python': '3.6.9 |Anaconda, Inc.| (default, Jul 30 2019, 19:07:31) \n[GCC 7.3.0]', 'platform': 'Linux-3.10.0-1160.31.1.el7.csd3.x86_64-x86_64-with-redhat-7.9-Nitrogen', 'event': 'train'}]

Versions

Linux-3.10.0-1160.31.1.el7.csd3.x86_64-x86_64-with-redhat-7.9-Nitrogen Python 3.6.9 |Anaconda, Inc.| (default, Jul 30 2019, 19:07:31) [GCC 7.3.0] Bits 64 NumPy 1.18.1 SciPy 1.4.1 gensim 4.0.1 FAST_VERSION 0

ginward avatar Jul 20 '21 08:07 ginward

Also discussed at: https://stackoverflow.com/q/68451937/130288

Most importantly: the words are already sorted in descending frequency by default - so a workaround for any case where this is happening, just like this, is: just don't call sort_by_descending_frequency, it can only hurt.

From followup there, it seems the current sort is non-stable - which is suboptimal here, as it'd be preferable for repeated sorts to be no-ops, rather than change the relative positions of (tied-in-frequency) words.

Still, the the code's intent was to handle reorderings even in such cases, and it's clearly not - the pairwise similarity of any 2 words shouldn't be changed by this operation. It's possible the use of Numpy's fancy-indexing isn't behaving as was expected - clobbering some values rather than creating a properly reordered version - on this line:

https://github.com/RaRe-Technologies/gensim/blob/b287fd841c31d0dfa899d784da0bd5b3669e104d/gensim/models/keyedvectors.py#L667

But also: it's unclear whether this method ever should be trying to do such reorderings after the vectors are allocated/initialized/trained. That would only be a need after some dynamic adjustments to the vocab and/or frequencies – a situation we've not handled very well, even as some incremental support has arrived. It's plausible that this method should just refuse to reorder a vocab when vectors exist, as a stopgap - unless situations where that might be needed are better characterized with examples & a look at how to truly support them (such as smart merges of multipl vocabularies/KeyedVectors instances).

gojomo avatar Jul 21 '21 16:07 gojomo

Also discussed at: https://stackoverflow.com/q/68451937/130288

Most importantly: the words are already sorted in descending frequency by default - so a workaround for any case where this is happening, just like this, is: just don't call sort_by_descending_frequency, it can only hurt.

From followup there, it seems the current sort is non-stable - which is suboptimal here, as it'd be preferable for repeated sorts to be no-ops, rather than change the relative positions of (tied-in-frequency) words.

Still, the the code's intent was to handle reorderings even in such cases, and it's clearly not - the pairwise similarity of any 2 words shouldn't be changed by this operation. It's possible the use of Numpy's fancy-indexing isn't behaving as was expected - clobbering some values rather than creating a properly reordered version - on this line:

https://github.com/RaRe-Technologies/gensim/blob/b287fd841c31d0dfa899d784da0bd5b3669e104d/gensim/models/keyedvectors.py#L667

But also: it's unclear whether this method ever should be trying to do such reorderings after the vectors are allocated/initialized/trained. That would only be a need after some dynamic adjustments to the vocab and/or frequencies – a situation we've not handled very well, even as some incremental support has arrived. It's plausible that this method should just refuse to reorder a vocab when vectors exist, as a stopgap - unless situations where that might be needed are better characterized with examples & a look at how to truly support them (such as smart merges of multipl vocabularies/KeyedVectors instances).

Is it possible that model loaded via load_facebook_model is not sorted by default? I did not see an option to sort vectors after loading Facebook models though ...

ginward avatar Jul 25 '21 11:07 ginward

Is it possible that model loaded via load_facebook_model is not sorted by default? I did not see an option to sort vectors after loading Facebook models though ...

It's possible, as there's no guarantee any given model is so sorted - you'd have to check with the creators of the model, or inspect it. But I'm pretty sure Fastbook's code also default to such a sort, so it'd be an odd choice not to sort the words in this traditional way.

(Still: no matter its original sort, or whether our re-sort is stable or not, the pairwise similarity calcs for any 2 words should be unchanged after such an operation.)

gojomo avatar Jul 26 '21 20:07 gojomo

Is it possible that model loaded via load_facebook_model is not sorted by default? I did not see an option to sort vectors after loading Facebook models though ...

It's possible, as there's no guarantee any given model is so sorted - you'd have to check with the creators of the model, or inspect it. But I'm pretty sure Fastbook's code also default to such a sort, so it'd be an odd choice not to sort the words in this traditional way.

(Still: no matter its original sort, or whether our re-sort is stable or not, the pairwise similarity calcs for any 2 words should be unchanged after such an operation.)

In this case, is this a possible bug of gensim then? 😂

ginward avatar Jul 27 '21 11:07 ginward

In this case, is this a possible bug of gensim then? 😂

Yes, that's what I meant when I said: "Still, the the code's intent was to handle reorderings even in such cases, and it's clearly not - the pairwise similarity of any 2 words shouldn't be changed by this operation."

(Given the rarity of truly needing such a re-sort after allocating the vectors, & the overhead of doing such a sort, it's still possible the "fix" is just to declare such an operation isn't officially supported.)

gojomo avatar Jul 27 '21 16:07 gojomo