Radim Řehůřek comments

Results 321 comments of


                                            Radim Řehůřek

trafficstars

word2vec vocab building is not multi-threaded

Can you say a little more about your use case? What is the performance now, what performance do you need, and why? We can certainly optimize this part if there's...

word2vec vocab building is not multi-threaded

Thread affinity is not related to GIL. I'm surprised you saw an improvement with multithreading at all -- do you have any numbers? Anyway, yes, parallelizing the vocab building sounds...

word2vec vocab building is not multi-threaded

`build_vocab` is still pure Python = slow. Essentially it does this: ```python vocab = defaultdict(int) sentences = [user-supplied iterable or LineSentence(text_file_path)] for sentence in sentences: for word in sentence: vocab[word]...

word2vec vocab building is not multi-threaded

Implement which part, what exactly? IMO the word counting logic is so trivial that any sort of fine-grained thread synchronization (locks) would do more harm than good. The Python overhead...

word2vec vocab building is not multi-threaded

I'd expect that (queuing objects with multiprocessing) to be significantly *slower* than the current single-threaded approach. But let me know what the numbers say.

word2vec vocab building is not multi-threaded

Yes please – implement your changes and then open a pull request.

word2vec vocab building is not multi-threaded

Yes if you're able to "squeeze in" the word counting into the same process that generates the data, that will avoid the extra corpus loop altogether. Basically amortize the data...

word2vec vocab building is not multi-threaded

Sure – Phrases is also a part of preprocessing. So conceptually, phrase detection belongs to the same computational step.

Cannot install gensim with pip 23.1

@kennysong sorry, old versions are Gensim are no longer supported. Any reason why you're not using the latest Gensim = v4.3.1?

Cannot install gensim with pip 23.1

What's the problem with those "old models from 2022", specifically? The model migration (if needed at all) should be [cosmetic](https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4). There have been many optimizations and fixes since 3.8.3, so...