gensim
gensim copied to clipboard
Adding new tags in doctag_vectors in
Hello!
I am training a doc2vec model on a tagged docset. I need to update it on new sets that contain new tags. Is there a way to update docvectors in gensim.doc2vec? How can I do it?
There is an old issue https://github.com/RaRe-Technologies/gensim/issues/1019 on the same topic, but it didn't help me as there were many changes in gensim. Maybe there is another way?
Expanding the set of known doctags hasn't been supported; the work allowing expanding the Word2Vec
vocabulary (via build_vocab(..., update=True)
was never tested/completed for Doc2Vec
, with intermittent crashing bugs like #1019.
Note that even if supported, such incremental expansions of a model are fraught with difficult tradeoffs. To the extent a new batch contains a different mix of words, word-senses, & topics than earlier data – & if it didn't, why bother with more training? – it will only "drag" parts of the model towards new weights, leaving others untouched, which risks degrading its overall usefulness unless you're carefully considering the mixes/balances between older & newer training data, & monitoring for ill-effects. (You can't assume incremental batches of new training are always improving things.)
The surest way to ensure balance between all training data is to re-train everything in one sessiion. That is, when new data arrives, add it to the full corpus, & train again on the full corpus, & use the later model's values instead of any earlier model (with which the later model's coordinates may not be compatible).
But if you thought you really needed to just do smaller updates, other options could include:
- Using
infer_vector()
to obtain vectors for new docs, using the frozen vocabulary/weights of the prior model. No new words would be learned, nor tags inserted into the model's set of known tag-vectors, but you could collect these new doc-vectors, and potentially also merge them with the original set of tag-vectors into some new, outside-the-model combined structure for searching them all. - Pre-reserving some tags for expected later batch training. EG: if your initial training contains 100,000 docs, & you know another 50,000 docs will appear later, you could include another 50,000 dummy docs with pre-reserved tags in your initial training - their vectors would be random junk at first. But calling
train()
later with these pre-reserved tags would improe those vectors, albeit with the same relative-balance issues I mentioned above. (Without interleaved re-presentation of the original 100,000 docs, the model might get arbitrarily wel-customized to the new docs, and tag-vectors/words would drift further out of comparability with the earlier docs.)
(There might be other options, depending on the details of how you're using the model/doc-vectors for downstream.)