gensim icon indicating copy to clipboard operation
gensim copied to clipboard

Can't add vector to pretrained fasttext model via .add_vector

Open om-hb opened this issue 3 years ago • 8 comments

Problem description

I'm trying to add a new vector to a pretrained fasttext model via .add_vector. However, it seems like the vector is not added if I check via .has_index_for.

Steps/code/corpus to reproduce

>>> from gensim.models import fasttext
>>> import numpy as np
>>> ft_model =  fasttext.load_facebook_vectors("fastText/cc.en.300.bin")
>>> ft_model.has_index_for("testtest")
False
>>> ft_model.add_vector("testtest", np.zeros((300,)))
2000000
>>> ft_model.has_index_for("testtest")
False
>>> ft_model.index_to_key[2000000]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: list index out of range

Versions

Windows-10-10.0.19041-SP0
Python 3.8.5 (default, Sep  3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)]
Bits 64
NumPy 1.20.3
SciPy 1.6.2
gensim 4.0.1
FAST_VERSION 1

om-hb avatar Aug 31 '21 13:08 om-hb

Hi! Have you found a solution yet? I encounter a same issue with a 4.x-trained (cf https://github.com/RaRe-Technologies/gensim/issues/3114) Word2Vec:

When adding new vectors:

  • using m.wv.add_vector(word, vector) works, BUT after saving to disk and reloading,

    • m.wv.get_vector(w) does return the vector
    • m.wv.has_index_for(w) does return True
    • m.wv.most_similar(w) returns ValueError: operands could not be broadcast together with shapes (115026,) (115025,)
  • using m.wv.add_vectors(list_of_words, list_of_numpy_arrays, replace = True), works, BUT after saving to disk, the model does not load:

Traceback (most recent call last):
  File "/home/simon/w2v/word2vec-utils/out/models/test_test.py", line 3, in <module>
    m2 = gensim.models.KeyedVectors.load("model_32_5_word2vec_sanitised.model")
  File "/home/simon/w2v/word2vec-utils/env2/lib/python3.9/site-packages/gensim/utils.py", line 486, in load
    obj._load_specials(fname, mmap, compress, subname)
  File "/home/simon/w2v/word2vec-utils/env2/lib/python3.9/site-packages/gensim/models/word2vec.py", line 1951, in _load_specials
    self.make_cum_table()  # rebuild cum_table from vocabulary
  File "/home/simon/w2v/word2vec-utils/env2/lib/python3.9/site-packages/gensim/models/word2vec.py", line 835, in make_cum_table
    count = self.wv.get_vecattr(word_index, 'count')
  File "/home/simon/w2v/word2vec-utils/env2/lib/python3.9/site-packages/gensim/models/keyedvectors.py", line 368, in get_vecattr
    return self.expandos[attr][index]
IndexError: index 115025 is out of bounds for axis 0 with size 115025

When updating vectors, m.wv.add_vectors(list_of_words, list_of_numpy_arrays, replace = True) works as intended.

Versions

Ubuntu 21.04 x86_64
Python 3.9.5 (default, May 11 2021, 08:20:37)
Bits 64
NumPy 1.21.2
SciPy 1.6.3
gensim 4.1.2
FAST_VERSION 1

Thanks

drvenabili avatar Oct 05 '21 08:10 drvenabili

Thanks for reporting. @gojomo any ideas? IIRC you rewrote this part for Gensim 4.0.

piskvorky avatar Oct 05 '21 08:10 piskvorky

Although all in the same related add_vector(s) functionality, these may be 3 separate bugs:

  • the initial submission by @om-hb here, add_vector() for one vector to FastTextKeyedVectors not having intended effect (and thus giving an IndexError for an index that should be present) even without any save/load cycle
  • the ValueError for KeyedVectors reported by @faustusdotbe, when doing a .most_similar() after both an add_vector() & a save/reload
  • the IndexError for KeyedVectors reported by @faustusdotbe, preventing a .load() after an .add_vectors() that specifically also added (rather than just changed) existing vectors

While the traceback for that last one specifically implicates some of the pre-4.0 refactorings, the add() methods were sort of an afterthought on types that, when first made, didn't offer growing-the-set-of-word-vectors after original allocation. So some of these may have been issues pre-4.0 as well.

The code clearly needs better test coverage & a deeper look/reork for consistency/correctness, which would probably resolve these (& #3114). I might have some time for this next week. As each of these looks pretty easy to reproduce, anyone wanting to contribute minimal self-contained test cases triggering each of the failures in current code could give any future fixes a running head-start.

gojomo avatar Oct 06 '21 22:10 gojomo

Is this problem still unresolved? I have reproduced the error ValueError: operands could not be broadcast together with shapes with the following program using the source code from the develop branch.

from gensim.models import Word2Vec
import numpy

model = Word2Vec(sentences=[
                            ["this", "is", "test1"],
                            ["that", "is", "test2"],
], vector_size=2, min_count=1)

print(model.wv.most_similar("test1", topn=1)) #=> [('test2', 0.9941185712814331)]

model.wv.add_vectors(["test3"], [numpy.array([0.5, 0.5])])

print(model.wv.most_similar("test1", topn=1)) #=> ValueError: operands could not be broadcast together with shapes (6,) (5,) 
  • Google colab
  • Python 3.7.13
  • gensim 4.1.3.dev0

Do you have a solution? Thanks.

s2terminal avatar Apr 08 '22 03:04 s2terminal

I have found that this ValueError can be resolved by using fill_norms(force=True).

from gensim.models import Word2Vec
import numpy

model = Word2Vec(sentences=[
                            ["this", "is", "test1"],
                            ["that", "is", "test2"],
], vector_size=2, min_count=1)

print(model.wv.most_similar("test1", topn=1)) #=> [('test2', 0.9941185712814331)]

model.wv.add_vectors(["test3"], [numpy.array([0.5, 0.5])])

model.wv.fill_norms(force=True) # added

print(model.wv.most_similar("test1", topn=1)) #=> [('test2', 0.9941185712814331)]

s2terminal avatar Apr 08 '22 03:04 s2terminal

Thanks for the fix! Yes, part of properly finishing/verifying/testing the add_vector(s) requires the norms to be flushed or updated after operations. The PR looks reasonable, but I think it's still missing related cases we'd want to fix:

  • if existing vectors are changed, but none net added, the existing len(self.norms) check might not trigger the right updates
  • similarly, via the single add_vector there's an accomodation for the vector landing in slots that were pre-allocated (to allow loops of sequential single-adds to not trigger time/space expensive grows every time) – & I think that's also fail to trigger the necessary update

The docs for the add methods may also deserve extra warnings about how such expansions to a KeyedVectors that's still a subsidiary component of a larger model may break assumptions the model was counting on. (EG: adding just to a KV won't update the enclosing model's output layer syn1neg or syn1, or other frequency/encoding tables - so such addtions are unsupported, witth undefined & likely-error-triggering effects, in such cases if any further enclosing-model operations, like further training, are attempted.)

gojomo avatar Apr 09 '22 20:04 gojomo

@gojomo Thanks for your comment. Indeed, this PR https://github.com/RaRe-Technologies/gensim/pull/3320 still does not seem to work well when updating vectors.

Does this mean that I should stop checking with len(self.norms) when the vectors are updated and always use fill_norms(force=True)?

Or should the correct specification be that the add method may cause some features of the model to stop working? As an example, should the library user explicitly run fill_norms(force=True) before running most_similar? In this case, I think the add method needs a warning, as you say.

s2terminal avatar Apr 15 '22 06:04 s2terminal

In general, it's never guaranteed that the norms have been calculated, so any method that requires them calls fill_norms() just in case.

So I think any method that invalidates anything about the current norms can just discard them, self.norms = None, similar to resize_vectors().

This 'lazy' approach means you can do a bunch of invalidating ops (like sequential adds) in a row, without paying the cost of the full norm-refresh until they're needed again.

gojomo avatar Apr 15 '22 10:04 gojomo