gensim
gensim copied to clipboard
misc ways to improve infer_vector
The default steps
should probably be higher: perhaps 10, or the same as the training iter
value.
If there are no known-tokens in the supplied doc_words
, the method will return the randomly-initialized, low-magnitude, untrained starting vector. (There were no target-words to predict and thus inference was a no-op, except for that random initialization.) This is probably not what the user wants, and the realistic-looking values may fool the user into thinking some work has happened. It would probably be better to return some constant (perhaps zero) vector in such cases, to be clear that all such text-runs with unknown words are similarly opaque, as far as the model is concerned.
The inference loop might be able to make use of #450 word/example batching (especially if it is extended to decay alpha), to turn all steps into one native call.
The method could take many examples, to infer a chunk of examples in one call. At the extreme, it could also make use of multithreading so such a large chunk finishes more quickly.
The infer_vector
method could also allow a user-supplied starting vector, allowing someone to try an alternate policy like the mean-of-word-vectors (as suggested in #460) or even just the zero-vector (which might be just as good as a random starting vector, for this purpose).
As requested on the forum at https://groups.google.com/d/msg/gensim/IIumafg4WkA/Ua3FdeFCJQAJ infer_vector()
could also allow known-doctags for the example text to be supplied. At least in some training modes (or if training code in DBOW mode were slightly altered), this might lead to an inferred vector that better represents the 'residual' meaning not already captured in the other doctag(s).
Madhumathi raised the question of infer_vector
speed-up on the forum
He is suggesting simply locking all the pre-trained doc and word-vectors and then calling model.train() in order to take advantage of multiple workers and batching.
Multithreading likely wouldn't offer much benefit for the case the method currently supports: a single new document. (Maybe with a really long document, but I doubt it.) It could definitely offer a benefit if inferring in large batches, as mentioned above.
Moving the loop inside the cythonized-methods might offer some benefit in the single-document case, especially with large documents and large steps
.
(Also noted in my forum reply: that calling infer_vector()
from multiple threads, or better yet multiple processes all loaded with the same model, would also offer speedups.)
Any progress on any of these suggestions? in particular, batching and/or support for corpusfile format seem desirable. I saw a big performance speedup using a corpus file for training and at this point bulk training is (potentially much) faster than bulk inference. I assume a lot of this is lack of batching support and leaving a fairly tight loop in python.
Note, I parallelized things out of process by loading a saved model into separate python processes, each responsible for 1/n of the dataset to infer. Obviously that helps at lot. So performance improvements to the core algo will have a big, multiplied effect!
Also, nice work on putting this together and supporting it. Thank you.
So far, this is just a "wishlist" of things that could be tackled to improve infer_vector()
- there's no specific plan to do any of this, unless/until someone shows up with the need & interest to contribute it.
so... you're saying someone should just make a pull request and send it over?
also, FWIW, I took a (very cursory) look at fasttext: https://github.com/facebookresearch/fastText/ it has comparable model performance (P/R, AUC) on my problem with similar run-time performance (secs) for training and substantially faster inference performance (with similar levels of concurrency).
Not sure what you mean by FastText's "inference". Are you referring to its supervised-classification mode? (That's not the same algorithm as Doc2Vec
, & gensim's FastText
doesn't implement that mode at all.)
But yes, these improvements or others really just need a quality PR.
Facebook Research's fastText implementation has a mode that trains word embeddings and then combines them (one of their papers suggests it's an average) to do sentence embeddings. Certainly a different algorithm. But some similar intuition (at least to word2vec) from a similar people (at least Mikolov is a shared author).
I didn't realize gensim has an implementation of fasttext too.
Yes, I believe that's the mode activated with the -supervised
flag, where the word-embeddings are trained to be better at predicting known-labels, when summed/averaged. Summing vectors will inherently be a lot faster than the iterative inference – simulated training – that's used in Doc2Vec
(aka 'Paragraph Vectors').
(I'd like that mode to be considered for gensim's FT implementation – for clear parity with the original implementation, and given its high similarity to other modes. But gensim's focus has been truly unsupervised methods, so thus far @piskvorky hasn't considered that mode to be in-scope. So it'd require his approval & a competent Python/Cython implementation to appear.)
I was searching for gensim perf stuff and found this bug again. This time I'm taking a closer look.
It looks to me like infer_vector()
uses a code path that calls into train_document_*
methods, similar to how _do_train_job
does. The analog for corpus file is in _do_train_epoch
which calls into d2v_train_epoch_*
methods. It seems like infer_vector
could be extended (or a parallel method like infer_corpus
or something) to take a corpus file and call into those same d2v_train_epoch_*
methods in an analogous way.
Seems like I'd need to set learn_doctags=False
, learn_words=False
, learn_hidden=False, and prepare some state similar to what's happening in https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/doc2vec.py#L613-L626 (initializing
doctag_vectors,
doctags_lockf`, etc.)
work
, neu1
, and cython_vocab
all look they are handled a little differently though. It seems like these could be handled as in word2vec's code path for _train_epoch_corpusfile
I'm not super familiar with the code base, so before I get any deeper, does this make sense as an approach? That is, create a parallel method infer_corpus
, follow the analogous path from word2vec's code path for _train_epoch_corpusfile
to initialize some state each epoch and call into d2v_train_epoch_*
setting the learn params false?
@gerner Is your main goal parallelism for batches? In such a case, most important would be to mimic the classic iterable-interface (rather than corpus_file
).
@gojomo yes, parallelism is part of the goal. Are you suggesting trying to match what's happening in Doc2Vec._do_train_job
which gets called when you use set corpus_iterable and not corpus_file? From what I've seen there isn't much parallelism when that happens (although I see all the threading code that is should achieve that). I usually see my 8-core CPU around 150% max when training in that path. When using corpus_file I see CPU utilization around 750%.
In addition to that, it seems like the corpus_file path is better optimized. All the data processing happens in cython or native code. It seems like if I want to be able to use this in production, or even just for fast research/development iterations, that's what I want to be using.
In the past I've spawned separate python processes to parallelize the inference. That works, but it's somewhat inconvenient, and it seems like corpus_file support for infer_vector won't be that hard to do, which I'm considering implementing.
Yes, since the corpus_iterable
path is both older & more general, improving it will usually be a higher priority - rather than more special-format/special-purpose paths like the corpus_file
approach.
The corpus_iterable
path does still suffer from Python-GIL-related bottlenecks that prevent all-core-utilization. (I'm surprised you max at 1.5 cores utilized - it should be possible to get higher. But yes, corpus_file
offers a simpler and more complete path to near-all-cores saturated.)
My hope, though, is that the corpus_iterable
path could be raised to parity with corpus_file
utilization, probably through an approach like that suggested in this comment (and surrounding related comments). It is overwhelmingly the avoidance of GIL-contention and waiting on a single IO thread, and not other optimizations, that give the corpus_file
path its current advantage. But, that advantage comes at the cost of redundant training logic across the two paths, and maintaining/documenting/debugging both paths. (Mode-specific bugs, like the potential corpus_file
issues in #2757 & #2693, and mode-specific limitations, like how corpus_file
Doc2Vec
training can only use plain-integer tags, one per document, are especially frustrating.)
Any update to speed up infer vector call?
Any update to speed up infer vector call?
The code in the gensim-4.0.1
release is the same as the current develop
trunk, and no optimization work is currently underway. A theoretical future update to the interface to allow batches of docs to be inferred together might give somewhat of a speedup, for users who have large batches of new documents.
Barring that work, calling .infer_vector()
from multiple threads or processes (with shared-memory models) might help users achieve the highest throughput.
The code in the
gensim-4.0.1
release is the same as the currentdevelop
trunk, and no optimization work is currently underway. A theoretical future update to the interface to allow batches of docs to be inferred together might give somewhat of a speedup, for users who have large batches of new documents.Barring that work, calling
.infer_vector()
from multiple threads or processes (with shared-memory models) might help users achieve the highest throughput.
Batch process is good for inferring multiple vectors. Is there any plan to improve single long document?
Batch process is good for inferring multiple vectors. Is there any plan to improve single long document?
Inference uses the same code as training, and I don't know any pending unimplemented ideas for making that code run any faster at the level of individual texts. So no such plans currently, and any such plans would require a theory for how the code might be faster.