gensim
gensim copied to clipboard
[early WIP] Fix/rationalize loss-tallying
PR to eventually address loss-tallying issues: #2617, #2735, #2743. Early tinkering stage.
Changes so far in Word2Vec:
- using
float64for all loss tallying - resetting tally to 0.0 per epoch - but remembering history elsewhere for duration of current
train()call - micro-tallying into a per-batch value rather than the global tally
- then, adding to global tally rather than replacing it
Though the real goal is sensible loss-tallying across all classes, I think these small changes already remedy #2735 (float32 swallows large loss-values) & #2743 (worker losses clobber each other).
An oddity from looking at per-epoch loss across a full run: all my hs runs have shown increasing loss every epoch, which makes no sense to me. And yet, the models at the end have moved word-vectors to more useful places (thus passing our minimal sanity-tests). I don't think my small changes could have caused this oddity (but maybe); I suspect something pre-existing in HS-mode loss-tallying is the real reason. When I have a chance I'll compare to the loss patterns for similar-modes/similar-data in something like the Facebook FastText code, that also reports running loss.
Training FB fasttext (HS, CBOW, no-ngrams ./fasttext cbow -verbose 5 -maxn 0 -bucket 0 -lr 0.025 -loss hs -thread 3 -input ~/Documents/Dev/gensim/enwik9 -output enwik9-cbow-nongrams-lr025-hs) shows decreasing loss reports through the course of training, as expected and unlike the strangely-increasing per-epoch loss our code (at least in this PR) reports. But, final results on a few quick most_similar ops seem very similar. So something remains odd about our loss reporting, especially in HS mode.
As a point of comparison, Facebook's fasttext reports an "average loss", divided over some trial-count, like so:
(base) gojomo@Gobuntu-2020:~/Documents/Dev/fasttext/fastText-0.9.2$ time ./fasttext cbow -verbose 5 -maxn 0 -bucket 0 -lr 0.025 -loss hs -thread 3 -input ~/Documents/Dev/gensim/enwik9 -output enwik9-cbow-nongrams-lr025-hs
Read 142M words
Number of words: 847816
Number of labels: 0
Progress: 39.8% words/sec/thread: 431099 lr: 0.015052 avg.loss: 5.263475 ETA: 0h 5m31s
Progress: 45.4% words/sec/thread: 429306 lr: 0.013645 avg.loss: 4.725245 ETA: 0h 5m 1s
Progress: 58.6% words/sec/thread: 426932 lr: 0.010339 avg.loss: 3.865230 ETA: 0h 3m50s
Progress: 100.0% words/sec/thread: 422384 lr: 0.000000 avg.loss: 2.483185 ETA: 0h 0m 0s
Gensim should probably collect & report 2Vec-class training loss in a comparable way, so that numbers on algorithmically-analogous runs are broadly similar, for familiarity to users & as a cross-check of whatever it is we're doing.
+1 on matching FB's logic. What is "trial-count"? Is the average taken over words or something else?
Unsure; their c++ (with a separate class for 'loss') is different enough from our code that I couldn't tell at-a-glance & will need to study it a bit more.
@gojomo cleaning up the loss-tallying logic still very much welcome. Did you figure out the "increasing loss" mystery?
We're planning to make a Gensim release soon – whether this PR gets in now or later, it will be a great addition.
These changes would likely apply, & help a bit in Word2Vec, with just a little adaptation to current develop. I could take a look this week & wouldn't expect any complications.
But getting consistent loss-tallying working in Doc2Vec & FastText, & ensuring a similar calculation & roughly similar loss magnitudes with other libraries (mainly Facebook FastText), would require more, & hard-to-estimate, effort. We kind of need someone who both – (1) needs it; & (2) can get deep into understanding the code – to rationalize the whole thing.
Never figured out why our hs mode reports growing loss despite the model improving as expected on other checks.