gensim icon indicating copy to clipboard operation
gensim copied to clipboard

potential 'alias method' negative-sampling optimization from 'Koan' paper

Open gojomo opened this issue 3 years ago • 5 comments
trafficstars

Per discussion on #1873, while I'm not convinced the 'Koan' paper about CBOW has identified an error or clear results-improvement, they do appear to have a faster negative-sampling method. Per my comment there:

Their use of an 'alias method' for negative-sampling looks like a separate, functionally-neutral but performance-enhancing change. It could be a nice separate optimization. I don't fully understand it yet – I think it'd use more memory than our current 'binary-search-over-list-of-split-points' approach, but remains O(1) (ignoring cache-locality effects) for arbitrarily large vocabularies. (The original word2vec.c approach of pre-filling a giant array was O(1) at a massive cost in memory that then also degraded its real performance due to cache misses, vs the binary-search that I replaced it with a long while back.)

While there's a chance it might not meet Gensim's needs due to memory overhead, or if it is overly complicated for a small benefit, it's worth investigating further as a potential enhancement. It's likely a small & self-contained change that'd be straightforward to tackle, & test/evaluate, separate from everything else. (Maybe: "good first issue" for new contributor?)

gojomo avatar Feb 22 '22 22:02 gojomo

Following the 'Koan' authors' references yields a spectacular step-by-step explanation of moving from naive sampling to state-of-the-art: https://www.keithschwarz.com/darts-dice-coins/

It starts with a few iterations building towards the original word2vec.c fill-a-giant-array approach; then talks about the binary-search I added, then improves through several more steps to the "Vose's alias method" state-of-the-art.

(In fact, its discussion of how binary-search can work even better with wildly-imbalanced distributions – as with usual word distros – suggests that short of a full move to state-of-the-art, a tiny change around our bisect_left() might offer a noticeable speedup. Namely, biasing the split points towards the head, not the midpoint, of the splits-array, would work better in the commonest-cases. That could happen either via some roughly-correct fixed headside-bias for zipfian distributions, or a precalculation of optimal binary-tree parent-node indexes rather than assuming mid = (lo + hi)/2 here.)

gojomo avatar Feb 22 '22 23:02 gojomo

Gensim code in question:

https://github.com/RaRe-Technologies/gensim/blob/7e898f492ddd784962c58395a358998ee7ffb831/gensim/models/word2vec_inner.pyx#L217-L243

This optimization seems fairly deep down the (C) stack; @gojomo what is your intuition re. its impact on end-to-end performance?

piskvorky avatar Feb 23 '22 08:02 piskvorky

From long ago, I recall my switch from the old method to this binary-search, via the bisect_left() method, caused a noticeable speedup in lengthy runs, so there may be remaining potential in further changes. A quick sanity check as to the remaining magnitude improvement possible, before doing more work, might be to replace bisect_left() with something quick & O(1) – like returning a constant, or a simple incrementing index – & see how much faster that (incorrect) alternative, performing the same magnitude of bulk operations, runs. (If speedup negligible, further optimizations here needn't be considered.)

Relatedly, some Python profiling suggested operations reliant on random-number generation stood out as a still-significant amount of total runtime, which resulted in some changes to do generation in larger batches, rather than one number at a time, which also led to noticeable speedups. The Cython RNG – already minimal, with its next_random being updated each place it's used with repeated code (that perhaps should be an inline function for clarity/robustness) – may still be amenable to some sort of large-batch or repeated-use optimization, as we don't strictly need strongly random draws everywhere, just balanced sampling over the whole training. That was the subject of my next-paragraph aside in the above-linked #1873 comment:

(There are likely other still low-hanging-fruit performance optimizations around some of the negative-sampling, reduced-windows, & frequent-word-downsampling code - such as using a faster RNG or simply reusing a smaller number of random-draws repeatedly. EG: just permute the values {1..window} once per epoch, then cycle through them for each center-word - it's not as random, but probably good enough. Something similar might work for negative-sampling or downsampling.)

But, that may require a bit more R&D & care around maintaining 'enough' randomness & not doing anything wrong in rare worst-ase situations. Making a choice as to whether replacing binary-search negative-sampling will help, & adopting a standard state-of-the-art algorithm, as proposed here, is more straightforward.

gojomo avatar Feb 23 '22 16:02 gojomo

OK, makes sense. Are you able to do that quick sanity check, estimating an upper bound on the achievable speed-up?

piskvorky avatar Feb 23 '22 17:02 piskvorky

Sure!

gojomo avatar Feb 23 '22 20:02 gojomo