gensim Fix computation of `Word2Vec` loss & add loss value to logging string

trafficstars

The current computation of word2vec losses has flaws. I address them in this PR.

This PR re-writes the computation of the loss for both CBOW and SG Word2Vec. The loss that is computed and reported is the running average NCE loss within the epoch. This means that for each new epoch, the counters are reset to 0, and the new average is computed. This was not the case before, and the loss was incremented during the whole training (but the total_examples used to average was only incremented during on epoch), which is not very informative, beside being also incorrect in the implementation. Moreover, reporting the average loss on an epoch appeared to me more in the spirit of what was trying to be achieved before.

The computation of the word2vec loss was flawed in many ways:

race condition on the running_training_loss parameter (updated concurrently in a GIL-free portion of the code)
As mentioned above, there was an incoherence between the accumulation of the loss on the whole run, and the reset of the averaging factor at each epoch.
Even if the points above were fixed, remains the following: the dividing factor for the average in the case of SG is wrong. The averaging factor in the case of SG should not be the effective words, but the effective samples (a new variable I introduce), because the loss is incremented as many times as there are positive examples that are sampled for an effective word.

Additionnally, I add the logging of the current value of the loss in the progress logger, when compute_loss is set to True, and I add a parameter to the word2vec_standalone script to trigger the reporting of the loss.

As a hint towards the fact that the current implementation is now correct, one can look at the first reported values of the loss, when the word embeddings are still relatively uniformly distributed. In this situation, the expectancy of the NCE loss (for Skip-Gram) should be -(N+1)\log(\sigma(0)). Which is 5.545 for N=7, 14.556 for N=20... which corresponds to the reported loss values in the current implementation.

Jul 19 '18 09:07 alreadytaikeune

Thanks @alreadytaikeune . The training loss is a recent addition and may contain errors. What is your use-case for using it? How did you find about this issue?

Re. your PR: what are the performance implications of your changes? Can you post a before/after benchmark?

Jul 19 '18 09:07 piskvorky

Thanks for your reply. I started digging into it because I needed to compare the evolution and values of the loss between two word2vec implementations. Therefore I needed to understand exactly how the loss was computed in gensim which lead me to uncover some issues and proposing these changes. Are the issues I pointed out clear to you? I don't know if I expressed them well enough.

As to the benchmark, I haven't really done any since the changes are very minor and should really not impact performances, but I'll do one if needed. Do you have some standard tools for this purpose that I can use, or do I need to write mine?

Finally, it seems the documentation has a hard time building. From what I understand, it is the lines

-loss 
                If present, the loss will be computed and printed during training

in the word2vec_standalone.py that are the cause. The problem is that there isn't any type after loss, because it is only a flag. I am not sure what the sphinx syntax is for this kind of stuff. Any idea?

Jul 19 '18 09:07 alreadytaikeune

@piskvorky Everything seems to be fixed now.

Jul 19 '18 13:07 alreadytaikeune

Thanks! Can you run the benchmark too? Simply train a model on the text9 corpus with/without this PR, using 4 parallel workers and computing / not computing the loss. Then post the four logs (old+loss, old-loss, new+loss, new-loss) to gist and add the links here in a comment.

There'll be some minor variance in the timings, even with same parameters, that's expected. But we're interested in a sanity check here (<<10% difference).

Jul 22 '18 15:07 piskvorky

@piskvorky

I have run the tests you said in the following environment

FROM python:2
RUN mkdir -p /home/gensim_user
WORKDIR /home/gensim_user
RUN apt-get update && apt-get install -y wget build-essential 
RUN pip install numpy scipy six smart_open

Using the script below:

#! /bin/bash

mkdir -p data
cur_dir=$(pwd)

if [ ! -f data/text9 ]; then
  if [ ! -f data/enwik9.zip ]; then
    wget -c http://mattmahoney.net/dc/enwik9.zip -P data
  fi
  if [ ! -f data/enwik9 ]; then
    unzip data/enwik9.zip -d data
  fi
  perl wikifil.pl data/enwik9 > data/text9
fi

GENSIM=/home/gensim_user/gensim
ARGS="-train data/text9 -output /tmp/test -window 5 -negative 5 -threads 4 -min_count 5 -iter 5 -cbow 0"
SCRIPT=/usr/local/lib/python2.7/site-packages/gensim-3.5.0-py2.7-linux-x86_64.egg/gensim/scripts/word2vec_standalone.py
if [ -d $GENSIM ]; then
  rm -r $GENSIM
fi
cd /home/gensim_user/ && git clone https://github.com/RaRe-Technologies/gensim && cd gensim &&\
git checkout 96444a7d0357fc836641ec32c65eb2fbffbee68d && python setup.py install
cd $cur_dir
echo "RUNNING WORD2VEC BASE NO LOSS"
python $SCRIPT $ARGS 2>&1 | tee logs_base_word2vec_no_loss
pip uninstall -y gensim
cd /home/gensim_user/ && git clone https://github.com/alreadytaikeune/gensim && cd gensim && git checkout develop && python setup.py install
cd $cur_dir
echo "RUNNING WORD2VEC NEW NO LOSS"
python $SCRIPT $ARGS 2>&1 |tee logs_new_word2vec_no_loss
echo "RUNNING WORD2VEC NEW+LOSS"
python $SCRIPT $ARGS -loss 2>&1 |tee logs_new_word2vec_loss

The results of the run can be found here: (Base no loss): https://gist.github.com/alreadytaikeune/554f21e95aa73ed414482d07b2e6314b (New no loss): https://gist.github.com/alreadytaikeune/a184e88c059e038e2023170bb29a1eb7 (New loss): https://gist.github.com/alreadytaikeune/62ac3ca2da0d5d860404f0993acc81ac

I've only run them once, because it takes a bit of time. If we need more solid proof, probably we should settle on a smaller size corpus and average more runs. Anyways, it seems that the cost incurred to the computation of the loss amounts to around 3% of the total runtime, while the computation time is left unaffected (ignoring small variations between runs) as was expected.

I haven't run the benchmark on the base code with the loss computation, because if we agree on the fact that the previous computation is flawed, then I don't really see the relevance. Maybe we should discuss more on this point to double check my reasoning and implementation.

Jul 23 '18 12:07 alreadytaikeune

Thanks a lot @alreadytaikeune ! That sounds good to me.

Let's wait for @menshikh-iv 's final verdict & merge, once he gets back from holiday.

Jul 23 '18 18:07 piskvorky

I have made the changes you requested, however I am a bit surprised by the outcomes of the tests. They all run smoothly on my machine and most of them pass in travis, except the py35-win one, for some reason that don't seem related at all with my changes.

Aug 13 '18 08:08 alreadytaikeune

@alreadytaikeune thanks, looks good!

Next steps:

Wait for merge #2127 (this almost done, I hope it will be merged soon)
Resolve merge-conflicts
Add same calculations for new training mode (from #2127)
Merge current PR

@alreadytaikeune I'll ping you when we'll be done with #2127.

Aug 15 '18 02:08 menshikh-iv

Hi @alreadytaikeune, we finally merged & release #2127, please do the steps mentioned in https://github.com/RaRe-Technologies/gensim/pull/2135#issuecomment-413075975 and I'll merge your PR too :star2:

Sep 24 '18 04:09 menshikh-iv

Hi, I've completed the steps you mentioned, but when running the tests it seems to me that one is hanging, in the doc2vec test set. I don't really have a clue why, it wasn't the case before the merge.

Oct 01 '18 09:10 alreadytaikeune

@alreadytaikeune I see no commits after my comment & merge conflict in PR, please do needed changes (or maybe you forgot to push your changes?). About any test hangs - reproduce it here first, please.

Oct 02 '18 04:10 menshikh-iv

Hey sorry for the delay, I was on holidays. Yes, I hadn't pushed my changes, I wanted to test them locally first. But here they are.

Oct 18 '18 15:10 alreadytaikeune

Ah it seems the CI server experiences the same stalling issues as me. I am not sure when I will have time to investigate that though...

Oct 18 '18 15:10 alreadytaikeune

@alreadytaikeune CI issues fixed, hang reproduced in Travis, do you have time to fix & finalize?

Jan 09 '19 10:01 menshikh-iv

I've found the bug, but I messed up a little bit when rebasing. I'm on it. I'll keep you posted.

Jan 18 '19 18:01 alreadytaikeune

Ok @menshikh-iv seems good now. Checks are passing and the loss value seems legit now. I guess some further development will be required on different models so they also can have their loss value logged. But at least there's no API incompatibility.

Jan 20 '19 11:01 alreadytaikeune

@menshikh-iv Hi, what's the status now? Do you have an idea where the conflicts come from and how to solve them. It seems to me that we could just regenerate these files with Cython, but I don't know if it is the principled way.

Feb 19 '19 14:02 alreadytaikeune

@alreadytaikeune Hi, I'll be taking care of this PR from now. Please give me a few days to read through it.

Feb 20 '19 02:02 mpenkov

Hi @mpenkov thanks for your feedback. I've incorporated some changes. Although I am a bit clueless about the reasons for the conflicts here. How should I proceed about solving them? Is there a preferred way?

About testing, I agree that it may be a good idea although it seems quite hard to do. The only thing I've done so far to convince myself that the values I now get for the loss I've explained in my very first message. Maybe we could write a test around the same ideas.

Feb 20 '19 09:02 alreadytaikeune

Don't worry about the conflicts. I'll take care of them during the merge: they are in autogenerated code, so it's easy to resolve.

Regarding tests: I think the actual loss output may be difficult to test (and not worth it) because it's being output by logs. What is worth unit testing is the new behavior you added to the training functions (counting and returning the effective number of samples). Can you add tests to cover that new behavior?

For example:

word2vec.py:train_batch_sg
word2vec.py:train_batch_cbow
word2vec.py:score_sentence_sg
word2vec.py:_do_train_job
... and their counterparts in the Cython code.

You use the same testing logic and data for Python/Cython (they should be 100% compatible).

Feb 21 '19 02:02 mpenkov

@alreadytaikeune What is the status of this PR? Are you able to finish it?

Apr 28 '19 04:04 mpenkov

@alreadytaikeune What do you think of the current status of this PR? best regards from austria

Aug 19 '19 15:08 tridelt

Hello @mpenkov and @tridelt. Sorry for not being more involved in this PR. I've had plenty on my plate in my job, and in my personal life as well, and didn't feel like spending free/family time working on tests. Although I'm not comfortable with leaving this as is either. I'll do my best in the coming days to address your comments. Sorry for not handling this earlier. Best.

Aug 19 '19 15:08 alreadytaikeune

Hello, @mpenkov. I found a bit of time to throw myself back into the PR. I thought I would rebase my develop branch on the current one and work on the tests from there, but it was not a good idea.... anyway, I've restored my branch as it was before.

Just one thing, when rebasing it seemed to me that all the non-cython implementation of word2vec was removed. Is it correct? Regarding your last comment about checks, I imagine I should now only care about writing tests for the cython functions.

Sorry about the mess...

Nov 12 '19 15:11 alreadytaikeune

@alreadytaikeune Yes,, the non-cython variants of many of the algorithms have been eliminated to reduce duplication/maintenance effort – so work going forward need only concern itself with the cython versions!

Nov 12 '19 19:11 gojomo

Alright thank you @gojomo . I also see that word2vec.py became a script as well as a module, and it does the same thing as scripts/word2vec_standalone.py, the file I had originally modified. Which one should I focus on?

Nov 12 '19 22:11 alreadytaikeune

AFAIK, word2vec.py has always implemented a main & been runnable as a script. Unsure of the reason for (partial) overlap with scripts/word2vec_standalone.py – & neither version has been updated recently to keep-up with newer Word2Vec initialization options (like ns_exponent or corpus_file). I suspect gensim usage is far more via imported code (vs command-line invocation), and new functions only get added to the main paths when someone specifically needs them. @piskvorky or @mpenkov would have to comment on whether both, or just one or the other, should be updated going forward.

Nov 13 '19 17:11 gojomo

@gojomo is right. I'm not aware of any changes to the CLI version of word2vec in Gensim. IIRC, the command line version was created to match the CLI of the original C tool by Google, including the CLI parameters names and defaults. I don't think it's been updated since.

I didn't remember what word2vec_standalone.py was. I tracked that script down to PR #593, which doesn't talk about its motivation. But a standalone script makes more sense to me, and is also cleaner (for example avoids pickling issues when run as __main__).

So my preference would be to get rid of __main__ in word2vec.py, and keep word2vec_standalone.py. We definitely don't need both.

Nov 13 '19 17:11 piskvorky

Thank you for the clarification. So, going forward, I'll just write some tests for cython only implementations (which is a bit confusing since I'm working on a version with still the two implementations, hence why I wanted to rebase in the first place), and leave to you clarifying between what's to be kept and dropped between word2vec.py and word2vec_standalone.py.

Nov 13 '19 17:11 alreadytaikeune

I was wondering about the status of this PR. Will it be fixed in gensim 4.0.0? Erroneous loss computation happens only in Word2Vec or in other classes too (i.e., FastText)? Thanks in advance

Dec 12 '20 13:12 geopapa11

gensim gensim copied to clipboard

Fix computation of `Word2Vec` loss & add loss value to logging string

gensim
gensim copied to clipboard