gensim
gensim copied to clipboard
save_word2vec_format TypeError when specifying count in KeyedVectors initialization
Problem description
When using preallocation for the initialization of KeyedVectors, the model cannot be stored with save_word2vec_format.
This prevents iteratively filling the model with add_vector as it would incur a big performance hit.
Steps/code/corpus to reproduce
Minimal example:
from gensim.models import KeyedVectors
from gensim.test.utils import get_tmpfile
import numpy as np
keys = [str(x) for x in range(0, 100)]
vectors = np.random.rand(100, 25)
# This works as expected
model_1= KeyedVectors(vector_size=25)
model_1.add_vectors(keys, vectors)
model_1.save_word2vec_format(get_tmpfile("test1.w2v"))
# But this fails during storing
model_2= KeyedVectors(vector_size=25, count=100)
model_2.add_vectors(keys, vectors)
model_2.save_word2vec_format(get_tmpfile("test2.w2v"))
Traceback of model 2:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-34-687b0e2e8c4f> in <module>
15 model_2= KeyedVectors(vector_size=25, count=100)
16 model_2.add_vectors(keys, vectors)
---> 17 model_2.save_word2vec_format(get_tmpfile("test2.w2v"))
~/miniconda3/envs/myenv/lib/python3.8/site-packages/gensim/models/keyedvectors.py in save_word2vec_format(self, fname, fvocab, binary, total_vec, write_header, prefix, append, sort_attr)
1573 fout.write(f"{total_vec} {self.vector_size}\n".encode('utf8'))
1574 for key in keys_to_write:
-> 1575 key_vector = self[key]
1576 if binary:
1577 fout.write(f"{prefix}{key} ".encode('utf8') + key_vector.astype(REAL).tobytes())
~/miniconda3/envs/myenv/lib/python3.8/site-packages/gensim/models/keyedvectors.py in __getitem__(self, key_or_keys)
382 return self.get_vector(key_or_keys)
383
--> 384 return vstack([self.get_vector(key) for key in key_or_keys])
385
386 def get_index(self, key, default=None):
TypeError: 'NoneType' object is not iterable
Versions
Linux-3.10.0-1062.9.1.el7.x86_64-x86_64-with-glibc2.10
Python 3.8.5 | packaged by conda-forge | (default, Aug 21 2020, 18:21:27)
[GCC 7.5.0]
Bits 64
NumPy 1.19.1
SciPy 1.5.2
gensim 4.0.0beta
FAST_VERSION 1
Thanks for the clear report of an issue in 4.0.0b w/ reproduction code! Looking into it...
Figured it out—the model has to have a count matching the ending number of keys or it throws an error when stored with save_word2vec_format. That said, it's still saved properly, though you'd have to open a text editor and change the count in the file's header for KeyedVectors.load_word2vec_format() to properly open it.
tldr: You have to add the number of vectors you preallocated for; it doesn't work like malloc().
.save_word2vec_format() predates the later-added ability to expand KeyedVectors, or to preallocate a larger-than-immediately-needed KeyedVectors for more-efficient incremental additions - so was never tested in such a situation. It will probably be enough to filter keys_to_write to include only non-None keys, before trying to write anything. Docs for the constructor and/or the save-method should note that because there's no way to retain the pre-allocation in this format, only 'used' slots will be in the saved file.
Thanks for looking into the issue. Indeed, adapting save_word2vec_format() to filter None keys AND adjusting the total_vec attribute does lead to the expected results.
Personally, I see the root of the problem in the way count is handled. In the given example, model 1 has a len() of 100, while model 2 has len() of 200. This discrepancy stems from the rather convoluted logic to keep track of the next_index for add_vector, which is ignored by add_vectors entirely. Keeping track of the index seems to be there only for the case of preallocating memory once and then incrementally adding until filled. After this point, this performance improvement seems to be no longer usable and batched adding with add_vectors seems to be the better solution. Moreover, there seem to be no other features that would warrant such complications, such as reallocation or deletion of vectors (thus, freeing their slots).
Hence, I wonder whether adding additional logic to fix this in a PR is the way forward. Conversely, starting a separate discussion on whether to deprecate preallocation or adding additional support for it could provide more long-term benefit.