gensim icon indicating copy to clipboard operation
gensim copied to clipboard

misc ways to improve infer_vector

Open gojomo opened this issue 9 years ago • 18 comments

The default steps should probably be higher: perhaps 10, or the same as the training iter value.

If there are no known-tokens in the supplied doc_words, the method will return the randomly-initialized, low-magnitude, untrained starting vector. (There were no target-words to predict and thus inference was a no-op, except for that random initialization.) This is probably not what the user wants, and the realistic-looking values may fool the user into thinking some work has happened. It would probably be better to return some constant (perhaps zero) vector in such cases, to be clear that all such text-runs with unknown words are similarly opaque, as far as the model is concerned.

The inference loop might be able to make use of #450 word/example batching (especially if it is extended to decay alpha), to turn all steps into one native call.

The method could take many examples, to infer a chunk of examples in one call. At the extreme, it could also make use of multithreading so such a large chunk finishes more quickly.

gojomo avatar Nov 09 '15 00:11 gojomo

The infer_vector method could also allow a user-supplied starting vector, allowing someone to try an alternate policy like the mean-of-word-vectors (as suggested in #460) or even just the zero-vector (which might be just as good as a random starting vector, for this purpose).

gojomo avatar Jan 10 '16 09:01 gojomo

As requested on the forum at infer_vector() could also allow known-doctags for the example text to be supplied. At least in some training modes (or if training code in DBOW mode were slightly altered), this might lead to an inferred vector that better represents the 'residual' meaning not already captured in the other doctag(s).

gojomo avatar Apr 13 '16 23:04 gojomo

Madhumathi raised the question of infer_vector speed-up on the forum

He is suggesting simply locking all the pre-trained doc and word-vectors and then calling model.train() in order to take advantage of multiple workers and batching.

tmylk avatar Sep 22 '16 13:09 tmylk

Multithreading likely wouldn't offer much benefit for the case the method currently supports: a single new document. (Maybe with a really long document, but I doubt it.) It could definitely offer a benefit if inferring in large batches, as mentioned above.

Moving the loop inside the cythonized-methods might offer some benefit in the single-document case, especially with large documents and large steps.

(Also noted in my forum reply: that calling infer_vector() from multiple threads, or better yet multiple processes all loaded with the same model, would also offer speedups.)

gojomo avatar Sep 22 '16 17:09 gojomo

Any progress on any of these suggestions? in particular, batching and/or support for corpusfile format seem desirable. I saw a big performance speedup using a corpus file for training and at this point bulk training is (potentially much) faster than bulk inference. I assume a lot of this is lack of batching support and leaving a fairly tight loop in python.

Note, I parallelized things out of process by loading a saved model into separate python processes, each responsible for 1/n of the dataset to infer. Obviously that helps at lot. So performance improvements to the core algo will have a big, multiplied effect!

Also, nice work on putting this together and supporting it. Thank you.

gerner avatar Feb 21 '20 19:02 gerner

So far, this is just a "wishlist" of things that could be tackled to improve infer_vector() - there's no specific plan to do any of this, unless/until someone shows up with the need & interest to contribute it.

gojomo avatar Feb 23 '20 22:02 gojomo

so... you're saying someone should just make a pull request and send it over?

also, FWIW, I took a (very cursory) look at fasttext: it has comparable model performance (P/R, AUC) on my problem with similar run-time performance (secs) for training and substantially faster inference performance (with similar levels of concurrency).

gerner avatar Feb 24 '20 19:02 gerner

Not sure what you mean by FastText's "inference". Are you referring to its supervised-classification mode? (That's not the same algorithm as Doc2Vec, & gensim's FastText doesn't implement that mode at all.)

But yes, these improvements or others really just need a quality PR.

gojomo avatar Feb 24 '20 22:02 gojomo

Facebook Research's fastText implementation has a mode that trains word embeddings and then combines them (one of their papers suggests it's an average) to do sentence embeddings. Certainly a different algorithm. But some similar intuition (at least to word2vec) from a similar people (at least Mikolov is a shared author).

I didn't realize gensim has an implementation of fasttext too.

gerner avatar Feb 24 '20 23:02 gerner

Yes, I believe that's the mode activated with the -supervised flag, where the word-embeddings are trained to be better at predicting known-labels, when summed/averaged. Summing vectors will inherently be a lot faster than the iterative inference – simulated training – that's used in Doc2Vec (aka 'Paragraph Vectors').

(I'd like that mode to be considered for gensim's FT implementation – for clear parity with the original implementation, and given its high similarity to other modes. But gensim's focus has been truly unsupervised methods, so thus far @piskvorky hasn't considered that mode to be in-scope. So it'd require his approval & a competent Python/Cython implementation to appear.)

gojomo avatar Feb 25 '20 00:02 gojomo

I was searching for gensim perf stuff and found this bug again. This time I'm taking a closer look.

It looks to me like infer_vector() uses a code path that calls into train_document_* methods, similar to how _do_train_job does. The analog for corpus file is in _do_train_epoch which calls into d2v_train_epoch_* methods. It seems like infer_vector could be extended (or a parallel method like infer_corpus or something) to take a corpus file and call into those same d2v_train_epoch_* methods in an analogous way.

Seems like I'd need to set learn_doctags=False, learn_words=False, learn_hidden=False, and prepare some state similar to what's happening in (initializing doctag_vectors, doctags_lockf`, etc.)

work, neu1, and cython_vocab all look they are handled a little differently though. It seems like these could be handled as in word2vec's code path for _train_epoch_corpusfile

I'm not super familiar with the code base, so before I get any deeper, does this make sense as an approach? That is, create a parallel method infer_corpus, follow the analogous path from word2vec's code path for _train_epoch_corpusfile to initialize some state each epoch and call into d2v_train_epoch_* setting the learn params false?

gerner avatar Aug 20 '20 21:08 gerner

@gerner Is your main goal parallelism for batches? In such a case, most important would be to mimic the classic iterable-interface (rather than corpus_file).

gojomo avatar Aug 21 '20 01:08 gojomo

@gojomo yes, parallelism is part of the goal. Are you suggesting trying to match what's happening in Doc2Vec._do_train_job which gets called when you use set corpus_iterable and not corpus_file? From what I've seen there isn't much parallelism when that happens (although I see all the threading code that is should achieve that). I usually see my 8-core CPU around 150% max when training in that path. When using corpus_file I see CPU utilization around 750%.

In addition to that, it seems like the corpus_file path is better optimized. All the data processing happens in cython or native code. It seems like if I want to be able to use this in production, or even just for fast research/development iterations, that's what I want to be using.

In the past I've spawned separate python processes to parallelize the inference. That works, but it's somewhat inconvenient, and it seems like corpus_file support for infer_vector won't be that hard to do, which I'm considering implementing.

gerner avatar Aug 21 '20 17:08 gerner

Yes, since the corpus_iterable path is both older & more general, improving it will usually be a higher priority - rather than more special-format/special-purpose paths like the corpus_file approach.

The corpus_iterable path does still suffer from Python-GIL-related bottlenecks that prevent all-core-utilization. (I'm surprised you max at 1.5 cores utilized - it should be possible to get higher. But yes, corpus_file offers a simpler and more complete path to near-all-cores saturated.)

My hope, though, is that the corpus_iterable path could be raised to parity with corpus_file utilization, probably through an approach like that suggested in this comment (and surrounding related comments). It is overwhelmingly the avoidance of GIL-contention and waiting on a single IO thread, and not other optimizations, that give the corpus_file path its current advantage. But, that advantage comes at the cost of redundant training logic across the two paths, and maintaining/documenting/debugging both paths. (Mode-specific bugs, like the potential corpus_file issues in #2757 & #2693, and mode-specific limitations, like how corpus_file Doc2Vec training can only use plain-integer tags, one per document, are especially frustrating.)

gojomo avatar Aug 21 '20 23:08 gojomo

Any update to speed up infer vector call?

mohzhang avatar Jul 29 '21 22:07 mohzhang

Any update to speed up infer vector call?

The code in the gensim-4.0.1 release is the same as the current develop trunk, and no optimization work is currently underway. A theoretical future update to the interface to allow batches of docs to be inferred together might give somewhat of a speedup, for users who have large batches of new documents.

Barring that work, calling .infer_vector() from multiple threads or processes (with shared-memory models) might help users achieve the highest throughput.

gojomo avatar Jul 30 '21 02:07 gojomo

The code in the gensim-4.0.1 release is the same as the current develop trunk, and no optimization work is currently underway. A theoretical future update to the interface to allow batches of docs to be inferred together might give somewhat of a speedup, for users who have large batches of new documents.

Barring that work, calling .infer_vector() from multiple threads or processes (with shared-memory models) might help users achieve the highest throughput.

Batch process is good for inferring multiple vectors. Is there any plan to improve single long document?

mohzhang avatar Jul 30 '21 22:07 mohzhang

Batch process is good for inferring multiple vectors. Is there any plan to improve single long document?

Inference uses the same code as training, and I don't know any pending unimplemented ideas for making that code run any faster at the level of individual texts. So no such plans currently, and any such plans would require a theory for how the code might be faster.

gojomo avatar Aug 02 '21 14:08 gojomo