gensim
gensim copied to clipboard
Doc2Vec: How to get rid of the saved large model.dv.vectors.npy file?
Problem description
How to get rid of the large model.dv.vectors.npy file?
Steps/code/corpus to reproduce
According to http://arxiv.org/abs/1405.4053v2, during inference, we only need the word vectors W and the model to inter a new paragraph vector. The dv is not used.
However, saving/loading of Doc2Vec requires the dv npy file. Is there any option to get rid of them? This file is too huge to deploy.
@zhang8473 if you only need the word vectors, they're in the KeyedVectors attribute model.wv.
@gojomo is it true the .dv attribute can be deleted after training? Is it really not needed anything?
IIRC we used to have some method that "finalized" training and cleaned up attributes not needed for inference. But we moved in the direction of exporting relevant "submodels" (KeyedVectors) instead.
Many users are primarily interested in the doc-vectors for the training texts, as calculated during training - so for them, having the .dv around to look up precalculated doc-vectors is central – even if they will be doing some new inference. Of course, if they only need the precalculated doc-vectors, they can just save aside the .dv instance of KeyedVectors, & discard the rest of the model.
People can choose to re-infer vectors for the training texts – & done right, these might even be a tad better than the leftover training-time-vectors – but that's a lot more time-consuming than looking up a precalculated doc-vector.
If you're only interested in inference, then yes, the .dv values left over from training aren't needed.
From a quick glance at the code, I'm not seeing anything that looks like it would obviously break if you just del the .dv attribute from the model object before trying a .save() then .load(). So, if you tried that and it didn't work, it'd be good to see the error(s) you hit, in which version of Gensim.
(If for some reason the Gensim save/load do break when .dv is absent, it might be possible to make them tolerant of its being missing. Alternatively, it might be sufficient to manually del the attribute then use standard Python pick
It might work to just delete the property, then use Python pickling to store/reload instead of the gensim .save()/.load() methods.)
Thank you.
Setting
model.dv = KeyedVectors(model.vector_size)
solves the problem.
However, I looked through gensim 4.1.2 and I do not see any "method that finalized training and cleaned up attributes not needed for inference".
BTW,
model.dv = None does not work:
624 epochs = epochs or self.epochs
625
--> 626 doctag_vectors = pseudorandom_weak_vector(self.dv.vector_size, seed_string=' '.join(doc_words))
627 doctag_vectors = doctag_vectors.reshape(1, self.dv.vector_size)
628
AttributeError: 'NoneType' object has no attribute 'vector_size'`
Doc2Vec saved vector_size in the dv.
That suggests (though there might be other complications) that we could make the class reobust against a mere deletion of the .dv attribute by retrieving the vector_size rom elsewhere... either some other cached attribute (mildly redundant with the configuration in .dv in the common-case), or perhaps via whichever syn1neg/syn1 internal matrix that will "always" be present.
Obligatory question… any appetite for a PR? @zhang8473 @gojomo
I'm not qualified to propose solutions, but I am also interested in understanding the best approach for shrinking trained Doc2Vec model size. I'm looking forward to some additional documentation.
We train large Doc2Vec models (20+ GiB) but only use them for vector inference on (new, unseen) texts. Dropping .dv radically shrinks the file size while still permitting usage of .infer_vector(...). However, I am concerned about unforeseen consequences.