gensim WordEmbeddingsKeyedVectors.add() doesn't clear `vectors_norm`, causing `IndexError` on later `most

trafficstars

As reported in a StackOverflow question/answer: https://stackoverflow.com/a/56641265/130288

An adapted version of the asker's minimal test case (which could become a unit test):

import numpy as np
from gensim.models.keyedvectors import WordEmbeddingsKeyedVectors

kv = WordEmbeddingsKeyedVectors(vector_size=3)
kv.add(entities=['a', 'b'],
       weights=[np.random.rand(3), np.random.rand(3)])
kv.most_similar('a')  # works

kv.add(entities=['c'], weights=[np.random.rand(3)])
kv.most_similar('c')  # fails with `IndexError`

Clearing the vectors_norm property (with either del or assignment-to-None) should be sufficient to trigger re-calculation upon the next most_similar().

Jun 18 '19 03:06 gojomo

Hey there, original poster here. Like I've added as an answer, below is a better solution.

While del or assignment to None might work, it will increase time it takes to run the code due to the fact that all the L2-normalized precomputed vectors that are cached in self.vectors_norm are erased after each addition of a new vector into the keyedvectors. This method essentially pre-computes the additional new vectors into L2-normalized ones and simply adds it to self.vector_norm so you can use them for faster performance.

In the gensim.models.keyedvectors file, under class WordEmbeddingKeyedVectors, we can change from

    def init_sims(self, replace=False):
        """Precompute L2-normalized vectors."""
        if getattr(self, 'vectors_norm', None) is None or replace:
            logger.info("precomputing L2-norms of word weight vectors")
            self.vectors_norm = _l2_norm(self.vectors, replace=replace)

to

    def init_sims(self, replace=False):
        """Precompute L2-normalized vectors."""
        if getattr(self, 'vectors_norm', None) is None or replace:
            logger.info("precomputing L2-norms of word weight vectors")
            self.vectors_norm = _l2_norm(self.vectors, replace=replace)
        elif (len(self.vectors_norm) == len(self.vectors)): #if all of the added vectors are pre-computed into L2-normalized vectors
            pass 
        else: #when there are vectors added but have not been pre-computed into L2-normalized vectors yet
            logger.info("adding L2-norm vectors for new documents")
            diff = len(self.vectors) - len(self.vectors_norm)
            self.vectors_norm = vstack((self.vectors_norm, _l2_norm(self.vectors[-diff:])))

Essentially what original function is doing is if there are no self.vectors_norm, it is calculated by L2-normalizing self.vectors. However, if there are any newly added vectors in self.vectors that have not been pre-computed into L2-normalized vectors, we should pre-compute them then add to the self.vectors_norm.

I'll add a pull request of this!

Jun 18 '19 20:06 seungwooson

Added a pull request here: https://github.com/RaRe-Technologies/gensim/pull/2533

Jun 18 '19 22:06 seungwooson

Thanks. @gojomo can you have a look?

Jun 19 '19 08:06 piskvorky

Thanks for the contribution! A reason this hasn't been handled efficiently is that these classes weren't originally designed for continual dynamic additions of new vectors. That's been gradually added over time.

I'm unsure, and will have to think a bit more, whether it's better for the unit-normed versions of added vectors to be added at the same time as the add() (so that once vectors_norm exists, the invariant is maintained that it always corresponds with any changes to the raw vectors), or for the extra norms to only be computed in a lazy fashion when needed (matching the original practice established by init_sims()).

Jun 19 '19 20:06 gojomo

@gojomo Apologies, my fault in making it unclear -- the change that I made is not computing unit-normed versions of added vectors at the same time as the add(). I did not make any changes in the add() method. The change is made within init_sims() only and is executed during methods like most_similar which calls init_sims() within, so it will be computed in a lazy fashion when needed like you said. The reason why I did not change within the add() method was because there would be many methods within the class whose architecture needs to be altered (init_sims might not be needed at all)

--edit: I think you understood what I intended and was telling me that you weren't sure whether it's better to change add() and compute unit-normed vectors on the fly and change other methods within the class accordingly or whether to keep the current architecture and just make the change within init_sims() like I suggested. Computationally I don't think there should be much difference but memory wise storing .vector_norms on the fly with .add() might take up extra memory even when it's not needed but this could also be mediated by taking out the need for .vectors to be stored in the first place, only storing .vector_norms as extra vectors are added (this way we would no longer be able to retrieve original vectors but we can add a boolean parameter for this that user can set when keyed vectors are initialized) . Let me know what you think, and I can make relevant changes and req a PR again.

Jun 19 '19 21:06 seungwooson

Any status for the PR review?

Feb 12 '20 09:02 mursabogdan1

I'll add some unittests and we should be good to go.

Feb 12 '20 10:02 seungwooson

Excellent - thank you very much.

Feb 12 '20 10:02 mursabogdan1

With further consideration, I'd prefer that vectors_norm, if it exists at all, is always complete – rather than sometimes having a 'ragged' length-disagreement with the raw vectors. (There should be fewer, rather than more, halfway-done states in the object's lifecycle.)

Separately, these classes are massively changed (& WordEmbeddingsKeyedVectors disappears entirely) in my in-progress #2698 big-refactor. So while you approach, now or as updated, may work & remain necessary, I'd want to hold off applying it until after #2698 lands, which could still be a few weeks away.

(Ultimately, it might be interesting to massively-reduce the memory overhead required by the vectors_norm cache by maintaining a single additional array of each raw-vectors magnitude – or alternatively, 1/magnitude. Then the bulky array could either be the raw version, or the unit-normed version, and a single multiplication could get the 'other' version on demand – for either individual rows or ranges or all. However, I'm not sure in that form, there's still an easy way to use the efficient bulk/vectorized/pipelined similarity-to-all calculations in most_similar() etc. I suspect there may be, but I just don't know it yet.)

Feb 12 '20 20:02 gojomo

Interesting idea.

The costly part of most_similar is the dot product of two matrices: (query_vecs, dim) x (dim, index_vecs). Having a vector of magnitudes for index_vecs stored separately (to rescale the result in one additional step) is no big deal, it shouldn't affect efficiency much.

The only "singularity" are empty vectors with zero norm, but that's an issue no matter what approach we choose.

Feb 12 '20 21:02 piskvorky

The issue would be if you'd still need to reify the full normed array, temporarily, every time you do the scale-and-dot operation. That'd be an allocation & computation slowdown (repeated every time, instead of once to prefill the cache) and your peak memory would still be just as high (so there's savings-at-rest, but not during repeated operations). Still, I suspect there might be some sort of optimized codepath that can do the scale-and-dot as a bulk/vectorized operation, so the full normed array isn't created, and the results are nearly-as-fast as any other way of doing it. I'm just not familiar enough with the BLAS/optimizing-compilation options to know for sure.

(A magnitudes, or alternatively reciprocal-of-magnitudes, array would also allow the "in-place" clobbering optimization to be reversed – we could conceivably have sets of vectors that are either 'raw', or normed, but always get back the other form when needed. Though all the special casing in such a situation might be bad complexity.)

Feb 12 '20 22:02 gojomo

I don't see any slow down – it's the exact same dot operation. The extra rescaling step cossim(q, raw_index) = dot(q, raw_index) / norm(raw_index) is cheap in comparison, when q is large.

Or what do you mean, can you give an example @gojomo ? Do you have other operation than cossim in mind?

Feb 13 '20 08:02 piskvorky

Aha, yes. I'd forgotten multiplication is ~~distributive over~~ commutative w/ dot, so was only thinking of dot(q, raw_index/norm(raw_index)) – which could create a full-sized interim value. But of course scaling by the norm works just as well outside the dot().

OK, my thought is then that either as part of #2698, or soon after it otherwise lands, I'll:

rework init_sims(), most_similar(), get_vector(key, use_norm=True), & related ops to instead work via a V (vocabulary-size) array of cached vector magnitudes, rather than the V * vector_size cache of already-normalized vectors we've ben holding in vectors_norm.
while this new array of cached magnitudes could be just named norms, I'm thinking that for clearer distinction with the previous approach, it should be called magnitudes instead
vectors_norm can be retained as a getter-method for compatibility, which can rebuild the whole normalized-vectors array on each call, but with a loud warning/deprecation.

Feb 13 '20 21:02 gojomo

Naming of variables TBD (in #2698 in general, when it's ready for review), but this magnitudes array is such a neat and simple idea I cannot believe we didn't think of it sooner. Keeping the huge duplicate 2D array in memory for years instead. Silly.

I just realized zero vectors shouldn't happen inside raw_index, so that's not an issue either.

Feb 13 '20 22:02 piskvorky

The details around this issue were lost in the transition to gensim 4.0, but the bug still exists in the KeyedVectors.add_vectors interface. Unless one calls wv.fill_norms(force=True) there is a possibility of an IndexError resulting in call to wv.similar_by_key.

Jun 13 '23 20:06 groceryheist

The details around this issue were lost in the transition to gensim 4.0, but the bug still exists in the KeyedVectors.add_vectors interface. Unless one calls wv.fill_norms(force=True) there is a possibility of an IndexError resulting in call to wv.similar_by_key.

Can you share a representative error traceback? Any obvious fix?

Jun 14 '23 02:06 gojomo

Here's a traceback:



  File ";lib/python3.11/site-packages/gensim/models/keyedvectors.py";, line 888, in similar_by_key
    return self.most_similar(positive=[key], topn=topn, restrict_vocab=restrict_vocab)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ";lib/python3.11/site-packages/gensim/models/keyedvectors.py";, line 841, in most_similar
    mean = self.get_mean_vector(keys, weight, pre_normalize=True, post_normalize=True, ignore_missing=False)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ";lib/python3.11/site-packages/gensim/models/keyedvectors.py";, line 514, in get_mean_vector
    vec = self.get_vector(key, norm=pre_normalize)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ";lib/python3.11/site-packages/gensim/models/keyedvectors.py";, line 449, in get_vector
    result = self.vectors[index] / self.norms[index]
                                   ~~~~~~~~~~^^^^^^^
IndexError: index 1193518 is out of bounds for axis 0 with size 1193518

Not sure about a fix as calling fill_norms(force=True) can be expensive. Perhaps it's possible to recompute the norms for just the added vectors as in the proposed fix to 3.x. It might help for me to note that I'm testing out the Mittens package with pretrained GloVe vectors to see if it works for fine-tuning.

Jun 14 '23 04:06 groceryheist

gensim
gensim copied to clipboard

WordEmbeddingsKeyedVectors.add() doesn't clear `vectors_norm`, causing `IndexError` on later `most_similar()`

gensim gensim copied to clipboard

WordEmbeddingsKeyedVectors.add() doesn't clear `vectors_norm`, causing `IndexError` on later `most_similar()`

gensim
gensim copied to clipboard