gensim word2vec vocab building is not multi-threaded

word2vec vocab building is not multi-threaded

Open khogeland opened this issue 9 years ago • 24 comments

I've been running a couple billion words through word2vec, using a 32-core EC2 instance. Actually training the model is blazing fast (relatively), but populating the vocab dict is painfully slow.

I forked and added multi-threading, but max_vocab_size pruning and progress logging were difficult to implement (you can't prune the vocab list of a single worker, since there could be more instances of a word in a different worker's sentences, and pruning after all the vocab counts have been merged defeats the purpose entirely).

Because I had to disable those two things, I'm not going to submit a pull request. Here's what I changed, perhaps @piskvorky or someone with more time could get vocab size pruning working with multithreading?

https://github.com/btym/gensim/blob/develop/gensim/models/word2vec.py#L480

Jul 14 '15 22:07 khogeland

Can you say a little more about your use case?

What is the performance now, what performance do you need, and why?

We can certainly optimize this part if there's a compelling reason to :)

Note: threading doesn't generally help much with Python, because of the global interpreter lock (GIL). And it looks like you're putting all sentences into plain lists, in RAM, which wouldn't scale (no streaming).

Jul 15 '15 08:07 piskvorky

It's going through a somewhat large history of github events (the past 7 months, ~1.8 billion words). Memory is not an issue in my case, but using your implementation building the vocab list is almost as slow as training the network, since it's only running on one core out of 32 (to be clear, threads will run properly on separate cores if core affinity is set correctly, but what I added could have been done just as easily with multiprocessing).

I realize the way I'm doing it is specific to my use case (not streaming, no max vocab size), but there's no reason for vocab building not to be muti-thread/multi-process, since it's such a huge performance loss. Streaming could be solved by having one iterator shared between threads and have the iterator handle locking. Not sure how max vocab size would work.

Jul 15 '15 08:07 khogeland

Thread affinity is not related to GIL. I'm surprised you saw an improvement with multithreading at all -- do you have any numbers?

Anyway, yes, parallelizing the vocab building sounds useful. If it clashes with the max vocab size pruning, I'd rather drop that and keep the parallel version. The parallelization optimization scenario sounds more commonly useful than pruning... although it brings more maintenance headaches (multiprocessing works differently on Windows). Do you have a way of testing on Windows?

If you find a way to build the vocab in parallel, while streaming the input (not everything in RAM), and being robust across platforms, can you please open a PR @btym?

For inspiration, see the actual training a few lines lower, which is also streamed and parallelized.

Jul 15 '15 11:07 piskvorky

I may have been mistaken and read the wrong thing, I only tried the changes on a small set since I was doing this while the other model was still training :) Should be able to switch to multiprocessing.

Can test on windows/osx/linux conveniently. Will try to submit a PR if I have time this week/weekend.

On Wed, Jul 15, 2015, 4:38 AM Radim Řehůřek [email protected] wrote:

Thread affinity is not related to GIL. I'm surprised you saw an improvement with multithreading at all -- do you have any numbers?

Anyway, yes, parallelizing the vocab building sounds useful. If it clashes with the max vocab size pruning, I'd rather drop that and keep the parallel version. The parallelization optimization scenario sounds more commonly useful than pruning... although it brings more maintenance headaches (multiprocessing works differently on Windows). Do you have a way of testing on Windows?

If you find a way to build the vocab in parallel, while streaming the input (not everything in RAM), can you please open a PR @btym https://github.com/btym?

For inspiration, see the actual training a few lines lower, which is also streamed.

— Reply to this email directly or view it on GitHub https://github.com/piskvorky/gensim/issues/400#issuecomment-121587962.

Jul 15 '15 16:07 khogeland

Hey guys,

Been following this issue as this is also a big concern for me. Training on 2B+ word sets takes a whole lot of time, but most of it is actually spent scanning the vocabulary.

Based on @piskvorky suggestion, I went ahead and added multiprocess support to word2vec scan_vocab, removing max vocab pruning as it was much more complicated to deal with.

Haven't run any quantitative benches but planning to do so tomorrow to compare scaling on many cores and with non-trivial LineSentence implementations.

For now the only hiccup is that my implementation is using Counters() to easily merge each worker vocab, and that was only introduced in 2.7. Have a look at the PR #406 and tell me what you think!

Jul 21 '15 13:07 fortiema

I need this feature too. I have about 500G training data in a directory, which consists about 500 files. It will be helpful if multi files can be scanned at the same in the vocabulary building step.

Jul 31 '15 01:07 Huarong

I have some Cython code from spaCy that is helpful here:

The spacy.strings.hash_string function uses Murmurhash to produce a 64-bit integer key
The preshed library has a very fast and very memory efficient 64-bit to 64-bit hash table, implemented using open addressing and linear probing. The table is slightly faster than Google's dense_hash_map for my use cases.

The strategy is to have two tables: an int-to-int table that stores the counts and only knows the hashes of the strings, and a Python table mapping the hashes to the strings. The Python table is only updated when we cross the minimum frequency threshold.

I wrote this up very quickly, but so far the performance seems much much faster than scan_vocab, with much better memory use. I'm up to 1.5b words, after running for 10 minutes.

from preshed.counter import PreshCounter
from spacy.strings import hash_string

class Corpus(object):
    def __init__(self, directory, min_freq=10):
        self.directory = directory
        self.counts = PreshCounter()
        self.strings = {}
        self.min_freq = min_freq

    def count_doc(self, words):
        # Get counts for this document
        doc_counts = PreshCounter()
        doc_strings = {}
        for word in words:
            key = hash_string(word)
            doc_counts.inc(key, 1)
            doc_strings[key] = word

        n = 0
        for key, count in doc_counts:
            self.counts.inc(key, count)
            # TODO: Why doesn't inc return this? =/
            corpus_count = self.counts[key]
            # Remember the string when we exceed min count
            if corpus_count >= self.min_freq and (corpus_count - count) < self.min_freq:
                 self.strings[key] = doc_strings[key]
            n += count
        return n

    def __iter__(self):
        for text_loc in iter_dir(self.directory):
            with io.open(text_loc, 'r', encoding='utf8') as file_:
                for sent_str in file_:
                    yield sent_str.split()

Dec 05 '15 10:12 honnibal

This night I realized that the dictionary building is very fast (faster than the variant with PreshCounter). The slow part is dictionary pruning (the procedure of removing words with counter of 1). If the number of unique tokens grows too fast, the dictionary is pruned on nearly every doc. To overcome this, it's better to remove all numbers, URLs and other special tokens from the corpus, e.g. by replacing them with placeholders like NUMBER, DATE, etc. Having preprocessed Russian Wikipedia like this, I got the dictionary built in 5 minutes (1.8b of words, ~6GB of preprocessed texts).

I would suggest mentioning this in the documentation explicitly.

Jun 13 '16 07:06 windj007

@windj007 Do you happen to have example code on replacing with placeholders? Assume it is some simple regexes. Would be great to add it to the documentation.

Oct 05 '16 13:10 tmylk

For Wikipedia, a regex like this one worked for me:

BAD_TOKENS_RE = re.compile(r'''^(https?://.*|[()',";?=\.#=:0-9/\-%«»*—|]+|.*\.(jpg|png|gif|svg)|.)$''')

It matches such types of tokens as URLs, special symbols and filenames.

At the moment I wrote my previous comment, I had a set of regexes and really replaced tokens with placeholders. Now, I merged all the regexs into this one and just drop the matched tokens.

Oct 05 '16 13:10 windj007

This night I realized that the dictionary building is very fast (faster than the variant with PreshCounter). The slow part is dictionary pruning (the procedure of removing words with counter of 1). If the number of unique tokens grows too fast, the dictionary is pruned on nearly every doc. To overcome this, it's better to remove all numbers, URLs and other special tokens from the corpus, e.g. by replacing them with placeholders like NUMBER, DATE, etc. Having preprocessed Russian Wikipedia like this, I got the dictionary built in 5 minutes (1.8b of words, ~6GB of preprocessed texts).

I would suggest mentioning this in the documentation explicitly.

have u used above code?

Aug 28 '19 13:08 gauravkoradiya

Is there any progress made with build_vocab() in 4.0 release? I just started with 105B tokens file and planning to train few versions of embedding models. Seems like build_vocab is rather more expensive than actual training.

I'm setting following variables related to this function:

MAX_VOCAB_SIZE  = None
TRIM_RULE       = None
SORTED_VOCAB    = 1
BATCH_WORDS     = 10000

Any suggestion to speed this up? I'm not sure on how should I use Honnibal's code.

Apr 28 '21 02:04 spate141

build_vocab is still pure Python = slow. Essentially it does this:

vocab = defaultdict(int)
sentences = [user-supplied iterable or LineSentence(text_file_path)]
for sentence in sentences:
    for word in sentence:
        vocab[word] += 1

I see two avenues to speed up build_vocab:

Optimize LineSentence or the loop above with Cython / C. Especially for the corpus_file code path, where we know the input is a file on a local disk (rather than any iterable or remote data stream), we should be able to do much better.

Plus, we can now use Bounter for fast in-memory counting, so we don't have to care about the vocab pruning at all.
In case of corpus_file, we could also parallelize build_vocab. Different threads/processes could jump to scan different places in the underlying local text file – something we cannot do with a generic iterable.

If anyone's interested in implementing either, let me know.

Apr 28 '21 06:04 piskvorky

build_vocab is still pure Python = slow. Essentially it does this:

vocab = defaultdict(int)
sentences = [user-supplied iterable or LineSentence(text_file_path)]
for sentence in sentences:
    for word in sentence:
        vocab[word] += 1

I see two avenues to speed up build_vocab:

1. Optimize LineSentence or the loop above with Cython / C. Especially for the `corpus_file` code path, where we know the input is a file on a local disk (rather than any iterable or remote data stream), we should be able to do much better.
   Plus, we can now use [Bounter](https://github.com/RaRe-Technologies/bounter) for fast in-memory counting, so we don't have to care about the vocab pruning at all.

2. In case of `corpus_file`, we could also parallelize `build_vocab`. Different threads/processes could jump to scan different places in the underlying local text file – something we cannot do with a generic iterable.

If anyone's interested in implementing either, let me know.

Hello! I need this and I'm willing to implement it.

I actually have a class that extends iterator and streams from an internal VPN resource a list of papers that I tokenize and pass to gensim.. but build_vocab is taking ages.

My Iterator is parallelized, meaning that there is a queue where the iterator reads sentences, and there are multiple threads writing on it.

I would like to parallelize build_vocab too. Please let me know the details and how to proceed.

Thanks Danilo

Mar 14 '23 06:03 ciropom

Implement which part, what exactly?

IMO the word counting logic is so trivial that any sort of fine-grained thread synchronization (locks) would do more harm than good. The Python overhead is too large.

So, a coarse grained approach looks more promising.

I mean, the proof will be in the benchmark numbers, but for the extra complexity to be worthwhile, we would need a whole multiples speedup (2x, 3x… not 20% or 30%). Do you think you can do it?

Mar 14 '23 08:03 piskvorky

I thought you already had a plan, but anyway this is my proposal:

define a read_queue
the build_vocab will

sentences = [user-supplied iterable or LineSentence(text_file_path)]
for sentence in sentences:
    read_queue.put(sentence)

spawn multiple Processes that reads from read_queue, each one with their own local vocab
every SYNC_VOCAB words processed, each process will global_vocab_queue.put(json.dumps(vocab)); vocab={};
another Process awaits local vocabularies from global_vocab_queue and merges them in a global vocabulary, summing up when needed.

at the end of the process, the global vocab will be the actual vocab, the local vocab will be lost after Processes shoutdown.

Mar 14 '23 09:03 ciropom

I'd expect that (queuing objects with multiprocessing) to be significantly slower than the current single-threaded approach.

But let me know what the numbers say.

Mar 14 '23 09:03 piskvorky

ok. practically, should I fork the project and ask for merge request afterwards?

Mar 14 '23 09:03 ciropom

Yes please – implement your changes and then open a pull request.

Mar 14 '23 09:03 piskvorky

I made some tests, and I realized that it should be even faster if I can manage to build the vocabulary myself, since I'm already parallelizing the work to pre-process the text.

I dig into the code and found the life-saving function build_vocab_from_freq that I can give to use a vocabulary prepared by me.

This solves the issue in my case since before, build_vocab was the bottleneck with 1 single process slowing down the many pre-processing processes.. Now I can spawn 50 parallel processes and all reach a 100% usage of CPU 100% of the time.

Mar 15 '23 16:03 ciropom

Yes if you're able to "squeeze in" the word counting into the same process that generates the data, that will avoid the extra corpus loop altogether. Basically amortize the data shuffling cost into an existing preprocessing step.

This would be true even if you didn't parallelize the preprocessing step. But even more so if you have it parallelized.

Mar 15 '23 19:03 piskvorky

I was wondering if I can do something similar for Phrases? because it suffers from the same single-core limitation.

Mar 16 '23 08:03 ciropom

Sure – Phrases is also a part of preprocessing. So conceptually, phrase detection belongs to the same computational step.

Mar 16 '23 11:03 piskvorky

OK, but practically how can I do it? I don't see a build_vocab_from_freq there or something that allows to overcome the 1-core limitation.

Mar 16 '23 17:03 ciropom

gensim gensim copied to clipboard

word2vec vocab building is not multi-threaded

gensim
gensim copied to clipboard