gensim
                                
                                 gensim copied to clipboard
                                
                                    gensim copied to clipboard
                            
                            
                            
                        Really remove the 10000-token limit in [Word2Vec, FastText, Doc2Vec]
The *2Vec models have an underdocumented implementation limit in their Cython paths: any single text passed to training that's more than 10000 tokens is silently truncated to 10000 tokens, discarding the rest. This may surprise users with larger texts - as much of the text, including words discovered during the vocabulary-survey (which doesn't truncate texts), can thus be skipped.
Fixing this would make a warning like that I objected to in PR #2861 irrelevant.
Fixing this would also fix #2583, the limit with respect to Doc2Vec inference.
As mentioned in #2583, one possible fix would be to auto-break user texts into smaller chunks. Possible fixes thus include:
- auto-breaking user texts into <10k token internal texts
- using malloc, rather than a stack-allocated array of constant length, inside the Cython routines (might add allocate/free overhead & achieve less cache-locality than the current approach)
- use alloca - not an official part of the relevant C standard but likely available everywhere relevant (MacOS, Windows, Linux, BSDs, other Unixes) - instead of the constant stack-allocated array (some risk of overflowing if users provide gigantic texts)
- doing some one-time allocation per thread in Python-land that's usually reused for in Cython land small-sized texts, but when oversized texts are encountered replacing that with a larger allocation.
Each of these may need to be done slightly differently in the corpus_file high-thread-parallelism codepaths.
Yes, this limit is arbitrary and feels weird. I'd also prefer to either get rid of it or be more explicit about its existence.
I wouldn't phrase this as a "fix" though, it's an enhancement – in fact, such "fix" would cause Gensim to deviate from its reference C implementation, which implements the same 10k-max-tokens-in-static-array truncation.
Would it surprise you to see that the word2vec.c code has an even smaller limit of 1000, so the 10000-character limit deviated from the the original model? (And similarly, Gensim's FastText copied gensim's Word2Vec rather than the 1024 limit in fasttext source.)
(I've changed the title to 'remove', though 'fix' can simply mean 'address/settle' without necessarily implicating its prior state was broken. But, I do see the way that this subtle limit has confused/surprised/misdirected people – including in my opinion code committers as in the handling of #2861, as an actual flaw.)
Only to the degree I trust my memory (which is to say, not much).
Getting rid of the limit altogether would definitely result in a more beautiful algorithm, no matter the terminology. The limit exists purely for performance reasons. Some of your ideas around cleverer (more targeted) memory allocators / job splitting might give the best of both worlds here.
Hi, I have a question about the 10000 token limit in Word2Vec. I created a sample sentence consisting of numbers from 1 to 200000 as tokens. I used it to train w2v and got a vector representation of each token. Shouldn't only the first 10000 tokens be available? If not then what specifically does this limit apply to?
The limit applies in the optimized, Cython-based training code – but not in the (pure-Python) survey of words that occurs during the .build_vocab() step.
So, your model will preallocate, & initialize random vectors for, every word that appears. But during training, only the tokens appearing in the 1st 10k positions will get training-updates.
(With enough training, this might exhibit in your synthetic test as vectors for the 1st 10k positions tending towards a different average magnitude, and/or directional bias, than the other 190k vectors that remain at their initial weak random positions.)
Thanks for the answer. In the case of sentences consisting of more than 10,000 tokens, they must be divided into smaller sentences of 10k tokens maximum.
However, is the division to be done by normal division every 10k tokens? For example, a sentence consisting of 25,000 tokens will be divided into:
- 1st sentence with tokens with indexes: 1-10000,
- 2nd sentence: 10001-20,000 tokens,
- 3rd sentence: 20001-25000 tokens.
Or maybe you need to take into account the window size when dividing the tokens in the edge cases? For example, a sentence consisting of 25,000 tokens with window_size = 2 will be divided into:
- 1st sentence containing tokens with indexes: 1-10000,
- 2nd sentence: 9998 - 19997 tokens,
- 3rd sentence: 19995-25000 tokens.
Which approach is better?
I doubt there'd be a detectable difference between the options you describe.
Each only slightly, & arbitrarily, changes a tiny number of window-sized contexts, and only in >10k word docs, and (with a typical window-size of 5) only in ~0.05%-0.1% of all contexts. That's noise far below the other arbitrary mutations caused by random negative-sampling, or negative frequent-word-downsampling.
In PV-DBOW mode without adding optional word-training (dm=0, dbow_words=0) the windows aren't relevant at all: every prediction is only docvec -> single-word, so the most-simple policy is fine. (Your proposed "small overlaps" alternative just slightly over-samples a few words – but again, that's unlikely to have any measurable effect.)
I suppose if you were neurotic about the risk some word->word relationship always appears across a 10k boundary – perhaps because of some pathologically regular formatting in the texts (which itself would be a bad sign for usual word2vec/etc analysis) – and thus could be completely erased by this splitting, even if it happens many times through the corpus, you could try a policy something like:
"every time a >10k text encountered, choose a split length of 10k - (current_epoch * window)" (or otherwise random but always safely below 10k)
Then, on each epoch, a slightly different split-point is chosen, and thus across all epochs, no word->word appearances in any position would be consistently erased.
But really, these algorithms aren't that senstive to anything happening rarely in the corpus or the corpus-handling, where the preponderance of uses, in typically large corpora, still give the same overall results.