gensim icon indicating copy to clipboard operation
gensim copied to clipboard

track training loss while using doc2vec issue.

Open skwolvie opened this issue 3 years ago • 5 comments

Problem description

I am trying to track training loss using doc2vec algorithm. And it failed. Is there a way to track training loss in doc2vec? Also, I didnt find any documentation related to performing early stopping while do2vec training phase?

the similarity score is varying a lot based on epochs, and I want to stop training when it has reached optimal capacity with callbacks. I have used keras, it has earlystopping feature. Not sure how to do it using gensim models.

Any response is appreciated. Thank you!

Steps/code/corpus to reproduce

class EpochLogger(CallbackAny2Vec):
    '''Callback to log information about training'''

    def __init__(self):
        self.epoch = 0

    def on_epoch_begin(self, model):
        print("Epoch #{} start".format(self.epoch))

    def on_epoch_end(self, model):
        print("Epoch #{} end".format(self.epoch))
        self.epoch += 1

epoch_logger = EpochLogger()

class LossLogger(CallbackAny2Vec):
    '''Output loss at each epoch'''
    def __init__(self):
        self.epoch = 1
        self.losses = []

    def on_epoch_begin(self, model):
        print(f'Epoch: {self.epoch}', end='\t')

    def on_epoch_end(self, model):
        loss = model.get_latest_training_loss()
        self.losses.append(loss)
        print(f'  Loss: {loss}')
        self.epoch += 1

loss_logger = LossLogger()

def train_model(data, ids, destination, alpha):

    print('\tTagging data .. ')
    tagged_data = [TaggedDocument(words=word_tokenize(str(_d).lower()), tags=[str(ids[i])]) for i, _d in enumerate(data)]

    print('\tPreparing model with the following parameters: epochs = {}, vector_size = {}, alpha = {} .. '.
          format(max_epochs, vec_size, alpha))

    model = Doc2Vec(vector_size=vec_size,
                    workers=cores//2,
                    alpha=alpha,  # initial learning rate
                    min_count=2,  # Ignore words having a total frequency below this
                    dm_mean=1,  # take mean of of word2vec and doc2vec
                    dm=1,
                    callbacks=[epoch_logger, loss_logger])  # PV-DM over PV-DBOW

    model.build_vocab(tagged_data, keep_raw_vocab=False, progress_per=100000)

Versions

Please provide the output of:

2017 4673
        Tagging data ..
        Preparing model with the following parameters: epochs = 50, vector_size = 100, alpha = 0.01 ..
        Beginning model training ..
                Iteration 0
                Learning Rate =  0.01
Epoch #0 start
Epoch: 1        Epoch #0 end
Traceback (most recent call last):
    loss = model.get_latest_training_loss()
AttributeError: 'Doc2Vec' object has no attribute 'get_latest_training_loss'

skwolvie avatar Oct 18 '20 14:10 skwolvie

Loss-tallying has never yet been implemented for Gensim's Doc2Vec model (see pending open issue #2617), and is pretty sketchy in the only place (Word2Vec) where it is implemented (#2735, #2743) - including odd behavior (rising reporting loss in otherwise-apparently-effective training, mismatch with rough magnitudes of similar loss-reporting from Facebook's FastText) that might be indicative of further undiagnosed bugs.

So, there's not yet reliable hooks for early-stopping in any of the *2Vec models.

gojomo avatar Oct 19 '20 17:10 gojomo

Loss-tallying has never yet been implemented for Gensim's Doc2Vec model (see pending open issue #2617), and is pretty sketchy in the only place (Word2Vec) where it is implemented (#2735, #2743) - including odd behavior (rising reporting loss in otherwise-apparently-effective training, mismatch with rough magnitudes of similar loss-reporting from Facebook's FastText) that might be indicative of further undiagnosed bugs.

So, there's not yet reliable hooks for early-stopping in any of the *2Vec models.

Then How do I choose the best model? Should I just blindly train the model for a lot of Epochs than just standard 20 epochs 5 iterations? Will that give any better results?

Do you happen to know 2Vec models by libraries other than gensim that can do this?

skwolvie avatar Oct 20 '20 11:10 skwolvie

The internal loss can't tell you what's the best model for a downstream purpose, only that the model isn't benefitting on its internal goals from further training. (A model settling at a lower internal loss may be worse, for some outside purpose, than one settling at a higher internal loss.) So, a lot of trial-and-error – though perhaps assisted with automated parameter search – is involved in picking the best model.

(When you say "standard 20 epochs 5 iterations", I suspect you might be making a common training mistake, since those usually shouldn't be separate value. But your code excerpt doesn't show your call(s) to .train() so I'm not sure what you're doing. See https://stackoverflow.com/questions/62801052/my-doc2vec-code-after-many-loops-of-training-isnt-giving-good-results-what-m for more info.)

I don't know of any library offering loss-reporting from a Doc2Vec implementation, but I'm not familiar with all the implementations, especially in non-Python languages.

gojomo avatar Oct 20 '20 19:10 gojomo

is there a solution here? I am getting a training loss of 0 for every epoch and after the first epoch, the results are pretty nice but after the second, they are terrible. Yet it's a black box and I have no ability to monitor the loss. thoughts? is there another implementation of Word2Vec outside of gensim?

griff4692 avatar Jun 28 '21 01:06 griff4692

@griff4692 - There are other word2vec options; but I'm not familiar with an alternate Python implementation of the "Paragraph Vectors" algorithm (aka Doc2Vec), much less one with loss-reporting.

If results after one epoch are good, but after more epochs are bad, there are probably other serious errors in your code which would need to be reviewed to be discovered. (That is: improvising an early-stop via loss-monitoring is probably the wrong fix.) See for example this SO answer about some really-misguided code that's unfortunately very common in oft-mimicked low-quality online examples:

https://stackoverflow.com/questions/62801052/my-doc2vec-code-after-many-loops-of-training-isnt-giving-good-results-what-m

Real improvement to the loss-tracking in Gensim's Doc2Vec (& other 2Vec models) is awaiting a code contributor who can rationalize/fix the many gaps and problems as outlined in #2617 & related issues like this one. But to discuss other workaround that might help in your case, I'd suggest posting more details to the project discussion list (https://groups.google.com/forum/#!forum/gensim) or a detailed SO question.

gojomo avatar Jun 28 '21 17:06 gojomo