gensim icon indicating copy to clipboard operation
gensim copied to clipboard

save_word2vec_format TypeError when specifying count in KeyedVectors initialization

Open Iseratho opened this issue 4 years ago • 4 comments

Problem description

When using preallocation for the initialization of KeyedVectors, the model cannot be stored with save_word2vec_format. This prevents iteratively filling the model with add_vector as it would incur a big performance hit.

Steps/code/corpus to reproduce

Minimal example:

from gensim.models import KeyedVectors
from gensim.test.utils import get_tmpfile
import numpy as np

keys = [str(x) for x in range(0, 100)]
vectors = np.random.rand(100, 25)

# This works as expected
model_1= KeyedVectors(vector_size=25)
model_1.add_vectors(keys, vectors)
model_1.save_word2vec_format(get_tmpfile("test1.w2v"))

# But this fails during storing
model_2= KeyedVectors(vector_size=25, count=100)
model_2.add_vectors(keys, vectors)
model_2.save_word2vec_format(get_tmpfile("test2.w2v"))

Traceback of model 2:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-34-687b0e2e8c4f> in <module>
     15 model_2= KeyedVectors(vector_size=25, count=100)
     16 model_2.add_vectors(keys, vectors)
---> 17 model_2.save_word2vec_format(get_tmpfile("test2.w2v"))

~/miniconda3/envs/myenv/lib/python3.8/site-packages/gensim/models/keyedvectors.py in save_word2vec_format(self, fname, fvocab, binary, total_vec, write_header, prefix, append, sort_attr)
   1573                 fout.write(f"{total_vec} {self.vector_size}\n".encode('utf8'))
   1574             for key in keys_to_write:
-> 1575                 key_vector = self[key]
   1576                 if binary:
   1577                     fout.write(f"{prefix}{key} ".encode('utf8') + key_vector.astype(REAL).tobytes())

~/miniconda3/envs/myenv/lib/python3.8/site-packages/gensim/models/keyedvectors.py in __getitem__(self, key_or_keys)
    382             return self.get_vector(key_or_keys)
    383 
--> 384         return vstack([self.get_vector(key) for key in key_or_keys])
    385 
    386     def get_index(self, key, default=None):

TypeError: 'NoneType' object is not iterable

Versions

Linux-3.10.0-1062.9.1.el7.x86_64-x86_64-with-glibc2.10
Python 3.8.5 | packaged by conda-forge | (default, Aug 21 2020, 18:21:27) 
[GCC 7.5.0]
Bits 64
NumPy 1.19.1
SciPy 1.5.2
gensim 4.0.0beta
FAST_VERSION 1

Iseratho avatar Nov 19 '20 09:11 Iseratho

Thanks for the clear report of an issue in 4.0.0b w/ reproduction code! Looking into it...

gojomo avatar Dec 07 '20 18:12 gojomo

Figured it out—the model has to have a count matching the ending number of keys or it throws an error when stored with save_word2vec_format. That said, it's still saved properly, though you'd have to open a text editor and change the count in the file's header for KeyedVectors.load_word2vec_format() to properly open it.

tldr: You have to add the number of vectors you preallocated for; it doesn't work like malloc().

lukelaurence avatar Jul 02 '22 01:07 lukelaurence

.save_word2vec_format() predates the later-added ability to expand KeyedVectors, or to preallocate a larger-than-immediately-needed KeyedVectors for more-efficient incremental additions - so was never tested in such a situation. It will probably be enough to filter keys_to_write to include only non-None keys, before trying to write anything. Docs for the constructor and/or the save-method should note that because there's no way to retain the pre-allocation in this format, only 'used' slots will be in the saved file.

gojomo avatar Jul 03 '22 18:07 gojomo

Thanks for looking into the issue. Indeed, adapting save_word2vec_format() to filter None keys AND adjusting the total_vec attribute does lead to the expected results.

Personally, I see the root of the problem in the way count is handled. In the given example, model 1 has a len() of 100, while model 2 has len() of 200. This discrepancy stems from the rather convoluted logic to keep track of the next_index for add_vector, which is ignored by add_vectors entirely. Keeping track of the index seems to be there only for the case of preallocating memory once and then incrementally adding until filled. After this point, this performance improvement seems to be no longer usable and batched adding with add_vectors seems to be the better solution. Moreover, there seem to be no other features that would warrant such complications, such as reallocation or deletion of vectors (thus, freeing their slots).

Hence, I wonder whether adding additional logic to fix this in a PR is the way forward. Conversely, starting a separate discussion on whether to deprecate preallocation or adding additional support for it could provide more long-term benefit.

Iseratho avatar Jul 04 '22 08:07 Iseratho