gensim
gensim copied to clipboard
Can't add vector to pretrained fasttext model via .add_vector
Problem description
I'm trying to add a new vector to a pretrained fasttext model via .add_vector
. However, it seems like the vector is not added if I check via .has_index_for
.
Steps/code/corpus to reproduce
>>> from gensim.models import fasttext
>>> import numpy as np
>>> ft_model = fasttext.load_facebook_vectors("fastText/cc.en.300.bin")
>>> ft_model.has_index_for("testtest")
False
>>> ft_model.add_vector("testtest", np.zeros((300,)))
2000000
>>> ft_model.has_index_for("testtest")
False
>>> ft_model.index_to_key[2000000]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IndexError: list index out of range
Versions
Windows-10-10.0.19041-SP0
Python 3.8.5 (default, Sep 3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)]
Bits 64
NumPy 1.20.3
SciPy 1.6.2
gensim 4.0.1
FAST_VERSION 1
Hi! Have you found a solution yet? I encounter a same issue with a 4.x-trained (cf https://github.com/RaRe-Technologies/gensim/issues/3114) Word2Vec:
When adding new vectors:
-
using
m.wv.add_vector(word, vector)
works, BUT after saving to disk and reloading,-
m.wv.get_vector(w)
does return the vector -
m.wv.has_index_for(w)
does return True -
m.wv.most_similar(w)
returnsValueError: operands could not be broadcast together with shapes (115026,) (115025,)
-
-
using
m.wv.add_vectors(list_of_words, list_of_numpy_arrays, replace = True)
, works, BUT after saving to disk, the model does not load:
Traceback (most recent call last):
File "/home/simon/w2v/word2vec-utils/out/models/test_test.py", line 3, in <module>
m2 = gensim.models.KeyedVectors.load("model_32_5_word2vec_sanitised.model")
File "/home/simon/w2v/word2vec-utils/env2/lib/python3.9/site-packages/gensim/utils.py", line 486, in load
obj._load_specials(fname, mmap, compress, subname)
File "/home/simon/w2v/word2vec-utils/env2/lib/python3.9/site-packages/gensim/models/word2vec.py", line 1951, in _load_specials
self.make_cum_table() # rebuild cum_table from vocabulary
File "/home/simon/w2v/word2vec-utils/env2/lib/python3.9/site-packages/gensim/models/word2vec.py", line 835, in make_cum_table
count = self.wv.get_vecattr(word_index, 'count')
File "/home/simon/w2v/word2vec-utils/env2/lib/python3.9/site-packages/gensim/models/keyedvectors.py", line 368, in get_vecattr
return self.expandos[attr][index]
IndexError: index 115025 is out of bounds for axis 0 with size 115025
When updating vectors, m.wv.add_vectors(list_of_words, list_of_numpy_arrays, replace = True)
works as intended.
Versions
Ubuntu 21.04 x86_64
Python 3.9.5 (default, May 11 2021, 08:20:37)
Bits 64
NumPy 1.21.2
SciPy 1.6.3
gensim 4.1.2
FAST_VERSION 1
Thanks
Thanks for reporting. @gojomo any ideas? IIRC you rewrote this part for Gensim 4.0.
Although all in the same related add_vector(s)
functionality, these may be 3 separate bugs:
- the initial submission by @om-hb here,
add_vector()
for one vector toFastTextKeyedVectors
not having intended effect (and thus giving anIndexError
for an index that should be present) even without any save/load cycle - the
ValueError
forKeyedVectors
reported by @faustusdotbe, when doing a.most_similar()
after both anadd_vector()
& a save/reload - the
IndexError
forKeyedVectors
reported by @faustusdotbe, preventing a.load()
after an.add_vectors()
that specifically also added (rather than just changed) existing vectors
While the traceback for that last one specifically implicates some of the pre-4.0 refactorings, the add()
methods were sort of an afterthought on types that, when first made, didn't offer growing-the-set-of-word-vectors after original allocation. So some of these may have been issues pre-4.0 as well.
The code clearly needs better test coverage & a deeper look/reork for consistency/correctness, which would probably resolve these (& #3114). I might have some time for this next week. As each of these looks pretty easy to reproduce, anyone wanting to contribute minimal self-contained test cases triggering each of the failures in current code could give any future fixes a running head-start.
Is this problem still unresolved? I have reproduced the error ValueError: operands could not be broadcast together with shapes
with the following program using the source code from the develop branch.
from gensim.models import Word2Vec
import numpy
model = Word2Vec(sentences=[
["this", "is", "test1"],
["that", "is", "test2"],
], vector_size=2, min_count=1)
print(model.wv.most_similar("test1", topn=1)) #=> [('test2', 0.9941185712814331)]
model.wv.add_vectors(["test3"], [numpy.array([0.5, 0.5])])
print(model.wv.most_similar("test1", topn=1)) #=> ValueError: operands could not be broadcast together with shapes (6,) (5,)
- Google colab
- Python 3.7.13
- gensim 4.1.3.dev0
Do you have a solution? Thanks.
I have found that this ValueError
can be resolved by using fill_norms(force=True)
.
from gensim.models import Word2Vec
import numpy
model = Word2Vec(sentences=[
["this", "is", "test1"],
["that", "is", "test2"],
], vector_size=2, min_count=1)
print(model.wv.most_similar("test1", topn=1)) #=> [('test2', 0.9941185712814331)]
model.wv.add_vectors(["test3"], [numpy.array([0.5, 0.5])])
model.wv.fill_norms(force=True) # added
print(model.wv.most_similar("test1", topn=1)) #=> [('test2', 0.9941185712814331)]
Thanks for the fix! Yes, part of properly finishing/verifying/testing the add_vector(s)
requires the norms to be flushed or updated after operations. The PR looks reasonable, but I think it's still missing related cases we'd want to fix:
- if existing vectors are changed, but none net added, the existing
len(self.norms)
check might not trigger the right updates - similarly, via the single
add_vector
there's an accomodation for the vector landing in slots that were pre-allocated (to allow loops of sequential single-adds to not trigger time/space expensive grows every time) – & I think that's also fail to trigger the necessary update
The docs for the add methods may also deserve extra warnings about how such expansions to a KeyedVectors
that's still a subsidiary component of a larger model may break assumptions the model was counting on. (EG: adding just to a KV won't update the enclosing model's output layer syn1neg
or syn1
, or other frequency/encoding tables - so such addtions are unsupported, witth undefined & likely-error-triggering effects, in such cases if any further enclosing-model operations, like further training, are attempted.)
@gojomo Thanks for your comment. Indeed, this PR https://github.com/RaRe-Technologies/gensim/pull/3320 still does not seem to work well when updating vectors.
Does this mean that I should stop checking with len(self.norms)
when the vectors are updated and always use fill_norms(force=True)
?
Or should the correct specification be that the add
method may cause some features of the model to stop working? As an example, should the library user explicitly run fill_norms(force=True)
before running most_similar
? In this case, I think the add
method needs a warning, as you say.
In general, it's never guaranteed that the norms have been calculated, so any method that requires them calls fill_norms()
just in case.
So I think any method that invalidates anything about the current norms can just discard them, self.norms = None
, similar to resize_vectors()
.
This 'lazy' approach means you can do a bunch of invalidating ops (like sequential adds) in a row, without paying the cost of the full norm-refresh until they're needed again.