gensim get_latest_training

Problem description

It seems that the get_latest_training_loss function in fasttext returns only 0. Both gensim 4.1.0 and 4.0.0 do not work.

from gensim.models.callbacks import CallbackAny2Vec
from pprint import pprint as print
from gensim.models.fasttext import FastText
from gensim.test.utils import datapath

class callback(CallbackAny2Vec):
    '''Callback to print loss after each epoch.'''

    def __init__(self):
        self.epoch = 0

    def on_epoch_end(self, model):
        loss = model.get_latest_training_loss()
        print('Loss after epoch {}: {}'.format(self.epoch, loss))
        self.epoch += 1

# Set file names for train and test data
corpus_file = datapath('lee_background.cor')

model = FastText(vector_size=100, callbacks=[callback()])

# build the vocabulary
model.build_vocab(corpus_file=corpus_file)

# train the model
model.train(
    corpus_file=corpus_file, epochs=model.epochs,
    total_examples=model.corpus_count, total_words=model.corpus_total_words,
    callbacks=model.callbacks, compute_loss=True,
)

print(model)

'Loss after epoch 0: 0.0'
'Loss after epoch 1: 0.0'
'Loss after epoch 2: 0.0'
'Loss after epoch 3: 0.0'
'Loss after epoch 4: 0.0'

If currently FastText does not support get_latest_training_loss, the documentation here needs to be removed:

https://radimrehurek.com/gensim/models/fasttext.html#gensim.models.fasttext.FastText.get_latest_training_loss

Versions

I have tried this in three different environments and neither of them works.

First environment:

[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import platform; print(platform.platform())
Linux-3.10.0-1160.36.2.el7.x86_64-x86_64-with-glibc2.17
>>> import sys; print("Python", sys.version)
Python 3.9.6 | packaged by conda-forge | (default, Jul 11 2021, 03:39:48)
[GCC 9.3.0]
>>> import struct; print("Bits", 8 * struct.calcsize("P"))
Bits 64
>>> import numpy; print("NumPy", numpy.__version__)
NumPy 1.21.2
>>> import scipy; print("SciPy", scipy.__version__)
SciPy 1.7.1
>>> import gensim; print("gensim", gensim.__version__)
gensim 4.1.0
>>> from gensim.models import word2vec;print("FAST_VERSION", word2vec.FAST_VERSION)
FAST_VERSION 0

Second environment:

Python 3.9.5 (default, May 18 2021, 12:31:01)
[Clang 10.0.0 ] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import platform; print(platform.platform())
macOS-10.16-x86_64-i386-64bit
>>> import sys; print("Python", sys.version)
Python 3.9.5 (default, May 18 2021, 12:31:01)
[Clang 10.0.0 ]
>>> import struct; print("Bits", 8 * struct.calcsize("P"))
Bits 64
>>> import numpy; print("NumPy", numpy.__version__)
NumPy 1.20.3
>>> import scipy; print("SciPy", scipy.__version__)
SciPy 1.7.1
>>> import gensim; print("gensim", gensim.__version__)
gensim 4.1.0
>>> from gensim.models import word2vec;print("FAST_VERSION", word2vec.FAST_VERSION)
FAST_VERSION 0

Third environment:

Python 3.9.5 (default, May 18 2021, 12:31:01)
[Clang 10.0.0 ] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import platform; print(platform.platform())
macOS-10.16-x86_64-i386-64bit
>>> import sys; print("Python", sys.version)
Python 3.9.5 (default, May 18 2021, 12:31:01)
[Clang 10.0.0 ]
>>> import struct; print("Bits", 8 * struct.calcsize("P"))
Bits 64
>>> import numpy; print("NumPy", numpy.__version__)
NumPy 1.20.3
>>> import scipy; print("SciPy", scipy.__version__)
SciPy 1.7.1
>>> import gensim; print("gensim", gensim.__version__)
/Users/jinhuawang/miniconda3/lib/python3.9/site-packages/gensim/similarities/__init__.py:15: UserWarning: The gensim.similarities.levenshtein submodule is disabled, because the optional Levenshtein package <https://pypi.org/project/python-Levenshtein/> is unavailable. Install Levenhstein (e.g. `pip install python-Levenshtein`) to suppress this warning.
  warnings.warn(msg)
gensim 4.0.0
>>> from gensim.models import word2vec;print("FAST_VERSION", word2vec.FAST_VERSION)
FAST_VERSION 0

Sep 08 '21 15:09 ginward

This is related to #2658, which probably should not be closed. @gojomo It seems that currently fasttext would not return the correct loss using get_latest_training_loss.

Sep 10 '21 03:09 ginward

#2658 is closed as a duplicate, because #2617 is a more comprehensive discussion of what broken (or simply never implemented) in the *2Vec models.

The docs are wrong to imply there's any loss-tallying in FastText - it's never been implemented. That could be corrected right away, by overriding the superclass method with another that documents/warns that there's no loss-tracking yet for the FastText model. Actually adding loss-trackiing to FastText (and Doc2Vec) will require a bit more design & work, as hinted in #2617 (& some of the other issues it references).

Sep 10 '21 07:09 gojomo

#2658 is closed as a duplicate, because #2617 is a more comprehensive discussion of what broken (or simply never implemented) in the *2Vec models.

The docs are wrong to imply there's any loss-tallying in FastText - it's never been implemented. That could be corrected right away, by overriding the superclass method with another that documents/warns that there's no loss-tracking yet for the FastText model. Actually adding loss-trackiing to FastText (and Doc2Vec) will require a bit more design & work, as hinted in #2617 (& some of the other issues it references).

I see. But if loss training has never been implemented, how do we know if the training needs to be early stopped, or if the training needs more epochs?

Sep 10 '21 08:09 ginward

I see. But if loss training has never been implemented, how do we know if the training needs to be early stopped, or if the training needs more epochs?

You'd have to use other heuristics. AFAIK, neither the original Google word2vec.c (on which Gensim's original implementation of the word2vec algorithm was closely based) nor the Facebook fasttext tool even offer early-stopping as an option: you pick your epochs & live with it until either training finishes or you destructively interrupt the training-in-progress. If you later suspect it was too little or too much, you try another value in a wholly-separate run.

They do each, however, show a running loss that a user can watch for hints.

It's definitely a desirable feature to have - hence the many requests, & partial/buggy implementation inside Gensim's Word2Vec, & the open #2617 expressing a goal of fixing/completing the work! It's just not been done, or urgently-required by someone who was sufficiently skilled & motivated to contribute/fund the necessary work, yet.

(Note, though, running-loss is also somewhat prone to misinterpretation, with some people thinking it's an accurate measure of model quality for other purposes, and that, of a set of candidate models, the one with the lowest loss will work best for outside purposes. That's not inherently the case, as it's just a report on the model's internal training goal. That internal goal is, if all sorts of other things are also done right, at best only an approximation of fitness for the real external purposes where people use word-vectors. For example, a massively-'overfit' model can have an arbitrarily low training loss, while being entirely useless for other tasks.)

Sep 10 '21 18:09 gojomo

After reading out some replies here and on stackoverflow, I'm aware that loss-tallying is yet to be implemented. However, the running loss after each epoch - differently than said here- is also always zero to me.

I'm running gemsim==4.0.1 and my example code is:

from gensim.models.callbacks import CallbackAny2Vec

class LossLogger(CallbackAny2Vec):
    '''Callback to print loss after each epoch.'''

    def __init__(self):
        self.epoch = 0

    def on_epoch_end(self, model):
        loss = model.get_latest_training_loss()
        print('Loss after epoch {}: {}'.format(self.epoch, loss))
        self.epoch += 1

callbacks = [LossLogger()]
        
doc2vec_model = Doc2Vec(
            documents,
            vector_size=128,
            window=0,
            min_count=5,
            dm=0,
            sample=0.0001,
            workers=4,
            epochs=10,
            alpha=0.025,
            seed=42,
            compute_loss = True,
            callbacks = callbacks
        )

Loss after epoch 1: 0.0
Loss after epoch 2: 0.0
Loss after epoch 3: 0.0
Loss after epoch 4: 0.0
Loss after epoch 5: 0.0
Loss after epoch 6: 0.0
Loss after epoch 7: 0.0
Loss after epoch 8: 0.0
Loss after epoch 9: 0.0

Why the model.get_latest_training_loss() return always 0 even if it was initialized with True ?

Sep 15 '21 11:09 cpuodzius

Gensim *2Vec model loss-tallying is... ...in Word2Vec, buggy/incomplete (but somewhat usable w/ workarounds). ...in Doc2Vec, never yet implemented, hence always 0. ...in FastText, never yet implemented, hence always 0.

But since Doc2Vec & FastText inherit fragments of the Word2Vec implementation (in initialization options & accessor method), it looks like it should work. But still, there's no tallying behind-the-scenes.

Sep 15 '21 15:09 gojomo

Shouldn't we raise NotImplementedError instead of returning zero? It'd be less surprising for the user.

Dec 04 '21 06:12 mpenkov

Shouldn't we raise NotImplementedError instead of returning zero? It'd be less surprising for the user.

That'd be better than the current mysteriously-incomplete behavior! But such hard failures should start as soon as the user takes any step guaranteed to disappoint - such as initializing a model that can't track loss with compute_loss=True. And of course the fasttext doc-comments also shouldn't be describing compute_loss and get_latest_training_lost() as if they were functional while they're not.

Dec 05 '21 23:12 gojomo

gensim
gensim copied to clipboard

get_latest_training_loss returns 0

Problem description

Versions

gensim gensim copied to clipboard

get_latest_training_loss returns 0

Problem description

Versions

gensim
gensim copied to clipboard