Radim Řehůřek
Radim Řehůřek
Can you say a little more about your use case? What is the performance now, what performance do you need, and why? We can certainly optimize this part if there's...
Thread affinity is not related to GIL. I'm surprised you saw an improvement with multithreading at all -- do you have any numbers? Anyway, yes, parallelizing the vocab building sounds...
`build_vocab` is still pure Python = slow. Essentially it does this: ```python vocab = defaultdict(int) sentences = [user-supplied iterable or LineSentence(text_file_path)] for sentence in sentences: for word in sentence: vocab[word]...
Implement which part, what exactly? IMO the word counting logic is so trivial that any sort of fine-grained thread synchronization (locks) would do more harm than good. The Python overhead...
I'd expect that (queuing objects with multiprocessing) to be significantly *slower* than the current single-threaded approach. But let me know what the numbers say.
Yes please – implement your changes and then open a pull request.
Yes if you're able to "squeeze in" the word counting into the same process that generates the data, that will avoid the extra corpus loop altogether. Basically amortize the data...
Sure – Phrases is also a part of preprocessing. So conceptually, phrase detection belongs to the same computational step.
@kennysong sorry, old versions are Gensim are no longer supported. Any reason why you're not using the latest Gensim = v4.3.1?
What's the problem with those "old models from 2022", specifically? The model migration (if needed at all) should be [cosmetic](https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4). There have been many optimizations and fixes since 3.8.3, so...