gensim [early WIP] Fix/rationalize loss-tallying

PR to eventually address loss-tallying issues: #2617, #2735, #2743. Early tinkering stage.

Aug 24 '20 20:08 gojomo

Changes so far in Word2Vec:

using float64 for all loss tallying
resetting tally to 0.0 per epoch - but remembering history elsewhere for duration of current train() call
micro-tallying into a per-batch value rather than the global tally
then, adding to global tally rather than replacing it

Though the real goal is sensible loss-tallying across all classes, I think these small changes already remedy #2735 (float32 swallows large loss-values) & #2743 (worker losses clobber each other).

An oddity from looking at per-epoch loss across a full run: all my hs runs have shown increasing loss every epoch, which makes no sense to me. And yet, the models at the end have moved word-vectors to more useful places (thus passing our minimal sanity-tests). I don't think my small changes could have caused this oddity (but maybe); I suspect something pre-existing in HS-mode loss-tallying is the real reason. When I have a chance I'll compare to the loss patterns for similar-modes/similar-data in something like the Facebook FastText code, that also reports running loss.

Sep 03 '20 23:09 gojomo

Training FB fasttext (HS, CBOW, no-ngrams ./fasttext cbow -verbose 5 -maxn 0 -bucket 0 -lr 0.025 -loss hs -thread 3 -input ~/Documents/Dev/gensim/enwik9 -output enwik9-cbow-nongrams-lr025-hs) shows decreasing loss reports through the course of training, as expected and unlike the strangely-increasing per-epoch loss our code (at least in this PR) reports. But, final results on a few quick most_similar ops seem very similar. So something remains odd about our loss reporting, especially in HS mode.

Sep 08 '20 06:09 gojomo

As a point of comparison, Facebook's fasttext reports an "average loss", divided over some trial-count, like so:

(base) gojomo@Gobuntu-2020:~/Documents/Dev/fasttext/fastText-0.9.2$ time ./fasttext cbow -verbose 5 -maxn 0 -bucket 0 -lr 0.025 -loss hs -thread 3 -input ~/Documents/Dev/gensim/enwik9 -output enwik9-cbow-nongrams-lr025-hs
Read 142M words
Number of words:  847816
Number of labels: 0
Progress:  39.8% words/sec/thread:  431099 lr:  0.015052 avg.loss:  5.263475 ETA:   0h 5m31s
Progress:  45.4% words/sec/thread:  429306 lr:  0.013645 avg.loss:  4.725245 ETA:   0h 5m 1s
Progress:  58.6% words/sec/thread:  426932 lr:  0.010339 avg.loss:  3.865230 ETA:   0h 3m50s
Progress: 100.0% words/sec/thread:  422384 lr:  0.000000 avg.loss:  2.483185 ETA:   0h 0m 0s

Gensim should probably collect & report 2Vec-class training loss in a comparable way, so that numbers on algorithmically-analogous runs are broadly similar, for familiarity to users & as a cross-check of whatever it is we're doing.

Sep 08 '20 20:09 gojomo

+1 on matching FB's logic. What is "trial-count"? Is the average taken over words or something else?

Sep 08 '20 21:09 piskvorky

Unsure; their c++ (with a separate class for 'loss') is different enough from our code that I couldn't tell at-a-glance & will need to study it a bit more.

Sep 08 '20 21:09 gojomo

@gojomo cleaning up the loss-tallying logic still very much welcome. Did you figure out the "increasing loss" mystery?

We're planning to make a Gensim release soon – whether this PR gets in now or later, it will be a great addition.

Feb 19 '22 19:02 piskvorky

These changes would likely apply, & help a bit in Word2Vec, with just a little adaptation to current develop. I could take a look this week & wouldn't expect any complications.

But getting consistent loss-tallying working in Doc2Vec & FastText, & ensuring a similar calculation & roughly similar loss magnitudes with other libraries (mainly Facebook FastText), would require more, & hard-to-estimate, effort. We kind of need someone who both – (1) needs it; & (2) can get deep into understanding the code – to rationalize the whole thing.

Never figured out why our hs mode reports growing loss despite the model improving as expected on other checks.

Feb 21 '22 00:02 gojomo

gensim gensim copied to clipboard

[early WIP] Fix/rationalize loss-tallying

gensim
gensim copied to clipboard