gensim icon indicating copy to clipboard operation
gensim copied to clipboard

Doc2Vec: How to get rid of the saved large model.dv.vectors.npy file?

Open zhang8473 opened this issue 3 years ago • 6 comments

Problem description

How to get rid of the large model.dv.vectors.npy file?

Steps/code/corpus to reproduce

According to http://arxiv.org/abs/1405.4053v2, during inference, we only need the word vectors W and the model to inter a new paragraph vector. The dv is not used.

However, saving/loading of Doc2Vec requires the dv npy file. Is there any option to get rid of them? This file is too huge to deploy.

zhang8473 avatar Feb 25 '22 06:02 zhang8473

@zhang8473 if you only need the word vectors, they're in the KeyedVectors attribute model.wv.

@gojomo is it true the .dv attribute can be deleted after training? Is it really not needed anything?

IIRC we used to have some method that "finalized" training and cleaned up attributes not needed for inference. But we moved in the direction of exporting relevant "submodels" (KeyedVectors) instead.

piskvorky avatar Feb 25 '22 08:02 piskvorky

Many users are primarily interested in the doc-vectors for the training texts, as calculated during training - so for them, having the .dv around to look up precalculated doc-vectors is central – even if they will be doing some new inference. Of course, if they only need the precalculated doc-vectors, they can just save aside the .dv instance of KeyedVectors, & discard the rest of the model.

People can choose to re-infer vectors for the training texts – & done right, these might even be a tad better than the leftover training-time-vectors – but that's a lot more time-consuming than looking up a precalculated doc-vector.

If you're only interested in inference, then yes, the .dv values left over from training aren't needed.

From a quick glance at the code, I'm not seeing anything that looks like it would obviously break if you just del the .dv attribute from the model object before trying a .save() then .load(). So, if you tried that and it didn't work, it'd be good to see the error(s) you hit, in which version of Gensim.

(If for some reason the Gensim save/load do break when .dv is absent, it might be possible to make them tolerant of its being missing. Alternatively, it might be sufficient to manually del the attribute then use standard Python pick It might work to just delete the property, then use Python pickling to store/reload instead of the gensim .save()/.load() methods.)

gojomo avatar Feb 25 '22 09:02 gojomo

Thank you. Setting model.dv = KeyedVectors(model.vector_size) solves the problem.

However, I looked through gensim 4.1.2 and I do not see any "method that finalized training and cleaned up attributes not needed for inference".


BTW, model.dv = None does not work:

    624         epochs = epochs or self.epochs
    625 
--> 626         doctag_vectors = pseudorandom_weak_vector(self.dv.vector_size, seed_string=' '.join(doc_words))
    627         doctag_vectors = doctag_vectors.reshape(1, self.dv.vector_size)
    628 

AttributeError: 'NoneType' object has no attribute 'vector_size'`

Doc2Vec saved vector_size in the dv.

zhang8473 avatar Feb 27 '22 04:02 zhang8473

That suggests (though there might be other complications) that we could make the class reobust against a mere deletion of the .dv attribute by retrieving the vector_size rom elsewhere... either some other cached attribute (mildly redundant with the configuration in .dv in the common-case), or perhaps via whichever syn1neg/syn1 internal matrix that will "always" be present.

gojomo avatar Feb 28 '22 17:02 gojomo

Obligatory question… any appetite for a PR? @zhang8473 @gojomo

piskvorky avatar Feb 28 '22 17:02 piskvorky

I'm not qualified to propose solutions, but I am also interested in understanding the best approach for shrinking trained Doc2Vec model size. I'm looking forward to some additional documentation.

We train large Doc2Vec models (20+ GiB) but only use them for vector inference on (new, unseen) texts. Dropping .dv radically shrinks the file size while still permitting usage of .infer_vector(...). However, I am concerned about unforeseen consequences.

afparsons avatar Jun 07 '22 03:06 afparsons