gensim
gensim copied to clipboard
Fix computation of `Word2Vec` loss & add loss value to logging string
The current computation of word2vec losses has flaws. I address them in this PR.
This PR re-writes the computation of the loss for both CBOW and SG Word2Vec. The loss that is computed and reported is the running average NCE loss within the epoch. This means that for each new epoch, the counters are reset to 0, and the new average is computed. This was not the case before, and the loss was incremented during the whole training (but the total_examples used to average was only incremented during on epoch), which is not very informative, beside being also incorrect in the implementation. Moreover, reporting the average loss on an epoch appeared to me more in the spirit of what was trying to be achieved before.
The computation of the word2vec loss was flawed in many ways:
- race condition on the running_training_loss parameter (updated concurrently in a GIL-free portion of the code)
- As mentioned above, there was an incoherence between the accumulation of the loss on the whole run, and the reset of the averaging factor at each epoch.
- Even if the points above were fixed, remains the following: the dividing factor for the average in the case of SG is wrong. The averaging factor in the case of SG should not be the effective words, but the effective samples (a new variable I introduce), because the loss is incremented as many times as there are positive examples that are sampled for an effective word.
Additionnally, I add the logging of the current value of the loss in the progress logger, when compute_loss is set to True, and I add a parameter to the word2vec_standalone script to trigger the reporting of the loss.
As a hint towards the fact that the current implementation is now correct, one can look at the first reported values of the loss, when the word embeddings are still relatively uniformly distributed. In this situation, the expectancy of the NCE loss (for Skip-Gram) should be -(N+1)\log(\sigma(0)). Which is 5.545 for N=7, 14.556 for N=20... which corresponds to the reported loss values in the current implementation.
Thanks @alreadytaikeune . The training loss is a recent addition and may contain errors. What is your use-case for using it? How did you find about this issue?
Re. your PR: what are the performance implications of your changes? Can you post a before/after benchmark?
Thanks for your reply. I started digging into it because I needed to compare the evolution and values of the loss between two word2vec implementations. Therefore I needed to understand exactly how the loss was computed in gensim which lead me to uncover some issues and proposing these changes. Are the issues I pointed out clear to you? I don't know if I expressed them well enough.
As to the benchmark, I haven't really done any since the changes are very minor and should really not impact performances, but I'll do one if needed. Do you have some standard tools for this purpose that I can use, or do I need to write mine?
Finally, it seems the documentation has a hard time building. From what I understand, it is the lines
-loss
If present, the loss will be computed and printed during training
in the word2vec_standalone.py that are the cause. The problem is that there isn't any type after loss, because it is only a flag. I am not sure what the sphinx syntax is for this kind of stuff. Any idea?
@piskvorky Everything seems to be fixed now.
Thanks! Can you run the benchmark too? Simply train a model on the text9 corpus with/without this PR, using 4 parallel workers and computing / not computing the loss. Then post the four logs (old+loss, old-loss, new+loss, new-loss) to gist and add the links here in a comment.
There'll be some minor variance in the timings, even with same parameters, that's expected. But we're interested in a sanity check here (<<10% difference).
@piskvorky
I have run the tests you said in the following environment
FROM python:2
RUN mkdir -p /home/gensim_user
WORKDIR /home/gensim_user
RUN apt-get update && apt-get install -y wget build-essential
RUN pip install numpy scipy six smart_open
Using the script below:
#! /bin/bash
mkdir -p data
cur_dir=$(pwd)
if [ ! -f data/text9 ]; then
if [ ! -f data/enwik9.zip ]; then
wget -c http://mattmahoney.net/dc/enwik9.zip -P data
fi
if [ ! -f data/enwik9 ]; then
unzip data/enwik9.zip -d data
fi
perl wikifil.pl data/enwik9 > data/text9
fi
GENSIM=/home/gensim_user/gensim
ARGS="-train data/text9 -output /tmp/test -window 5 -negative 5 -threads 4 -min_count 5 -iter 5 -cbow 0"
SCRIPT=/usr/local/lib/python2.7/site-packages/gensim-3.5.0-py2.7-linux-x86_64.egg/gensim/scripts/word2vec_standalone.py
if [ -d $GENSIM ]; then
rm -r $GENSIM
fi
cd /home/gensim_user/ && git clone https://github.com/RaRe-Technologies/gensim && cd gensim &&\
git checkout 96444a7d0357fc836641ec32c65eb2fbffbee68d && python setup.py install
cd $cur_dir
echo "RUNNING WORD2VEC BASE NO LOSS"
python $SCRIPT $ARGS 2>&1 | tee logs_base_word2vec_no_loss
pip uninstall -y gensim
cd /home/gensim_user/ && git clone https://github.com/alreadytaikeune/gensim && cd gensim && git checkout develop && python setup.py install
cd $cur_dir
echo "RUNNING WORD2VEC NEW NO LOSS"
python $SCRIPT $ARGS 2>&1 |tee logs_new_word2vec_no_loss
echo "RUNNING WORD2VEC NEW+LOSS"
python $SCRIPT $ARGS -loss 2>&1 |tee logs_new_word2vec_loss
The results of the run can be found here: (Base no loss): https://gist.github.com/alreadytaikeune/554f21e95aa73ed414482d07b2e6314b (New no loss): https://gist.github.com/alreadytaikeune/a184e88c059e038e2023170bb29a1eb7 (New loss): https://gist.github.com/alreadytaikeune/62ac3ca2da0d5d860404f0993acc81ac
I've only run them once, because it takes a bit of time. If we need more solid proof, probably we should settle on a smaller size corpus and average more runs. Anyways, it seems that the cost incurred to the computation of the loss amounts to around 3% of the total runtime, while the computation time is left unaffected (ignoring small variations between runs) as was expected.
I haven't run the benchmark on the base code with the loss computation, because if we agree on the fact that the previous computation is flawed, then I don't really see the relevance. Maybe we should discuss more on this point to double check my reasoning and implementation.
Thanks a lot @alreadytaikeune ! That sounds good to me.
Let's wait for @menshikh-iv 's final verdict & merge, once he gets back from holiday.
I have made the changes you requested, however I am a bit surprised by the outcomes of the tests. They all run smoothly on my machine and most of them pass in travis, except the py35-win one, for some reason that don't seem related at all with my changes.
@alreadytaikeune thanks, looks good!
Next steps:
- Wait for merge #2127 (this almost done, I hope it will be merged soon)
- Resolve merge-conflicts
- Add same calculations for new training mode (from #2127)
- Merge current PR
@alreadytaikeune I'll ping you when we'll be done with #2127.
Hi @alreadytaikeune, we finally merged & release #2127, please do the steps mentioned in https://github.com/RaRe-Technologies/gensim/pull/2135#issuecomment-413075975 and I'll merge your PR too :star2:
Hi, I've completed the steps you mentioned, but when running the tests it seems to me that one is hanging, in the doc2vec test set. I don't really have a clue why, it wasn't the case before the merge.
@alreadytaikeune I see no commits after my comment & merge conflict in PR, please do needed changes (or maybe you forgot to push your changes?). About any test hangs - reproduce it here first, please.
Hey sorry for the delay, I was on holidays. Yes, I hadn't pushed my changes, I wanted to test them locally first. But here they are.
Ah it seems the CI server experiences the same stalling issues as me. I am not sure when I will have time to investigate that though...
@alreadytaikeune CI issues fixed, hang reproduced in Travis, do you have time to fix & finalize?
I've found the bug, but I messed up a little bit when rebasing. I'm on it. I'll keep you posted.
Ok @menshikh-iv seems good now. Checks are passing and the loss value seems legit now. I guess some further development will be required on different models so they also can have their loss value logged. But at least there's no API incompatibility.
@menshikh-iv Hi, what's the status now? Do you have an idea where the conflicts come from and how to solve them. It seems to me that we could just regenerate these files with Cython, but I don't know if it is the principled way.
@alreadytaikeune Hi, I'll be taking care of this PR from now. Please give me a few days to read through it.
Hi @mpenkov thanks for your feedback. I've incorporated some changes. Although I am a bit clueless about the reasons for the conflicts here. How should I proceed about solving them? Is there a preferred way?
About testing, I agree that it may be a good idea although it seems quite hard to do. The only thing I've done so far to convince myself that the values I now get for the loss I've explained in my very first message. Maybe we could write a test around the same ideas.
Don't worry about the conflicts. I'll take care of them during the merge: they are in autogenerated code, so it's easy to resolve.
Regarding tests: I think the actual loss output may be difficult to test (and not worth it) because it's being output by logs. What is worth unit testing is the new behavior you added to the training functions (counting and returning the effective number of samples). Can you add tests to cover that new behavior?
For example:
- word2vec.py:train_batch_sg
- word2vec.py:train_batch_cbow
- word2vec.py:score_sentence_sg
- word2vec.py:_do_train_job
- ... and their counterparts in the Cython code.
You use the same testing logic and data for Python/Cython (they should be 100% compatible).
@alreadytaikeune What is the status of this PR? Are you able to finish it?
@alreadytaikeune What do you think of the current status of this PR? best regards from austria
Hello @mpenkov and @tridelt. Sorry for not being more involved in this PR. I've had plenty on my plate in my job, and in my personal life as well, and didn't feel like spending free/family time working on tests. Although I'm not comfortable with leaving this as is either. I'll do my best in the coming days to address your comments. Sorry for not handling this earlier. Best.
Hello, @mpenkov. I found a bit of time to throw myself back into the PR. I thought I would rebase my develop branch on the current one and work on the tests from there, but it was not a good idea.... anyway, I've restored my branch as it was before.
Just one thing, when rebasing it seemed to me that all the non-cython implementation of word2vec was removed. Is it correct? Regarding your last comment about checks, I imagine I should now only care about writing tests for the cython functions.
Sorry about the mess...
@alreadytaikeune Yes,, the non-cython variants of many of the algorithms have been eliminated to reduce duplication/maintenance effort – so work going forward need only concern itself with the cython versions!
Alright thank you @gojomo . I also see that word2vec.py became a script as well as a module, and it does the same thing as scripts/word2vec_standalone.py, the file I had originally modified. Which one should I focus on?
AFAIK, word2vec.py has always implemented a main & been runnable as a script. Unsure of the reason for (partial) overlap with scripts/word2vec_standalone.py – & neither version has been updated recently to keep-up with newer Word2Vec initialization options (like ns_exponent or corpus_file). I suspect gensim usage is far more via imported code (vs command-line invocation), and new functions only get added to the main paths when someone specifically needs them. @piskvorky or @mpenkov would have to comment on whether both, or just one or the other, should be updated going forward.
@gojomo is right. I'm not aware of any changes to the CLI version of word2vec in Gensim. IIRC, the command line version was created to match the CLI of the original C tool by Google, including the CLI parameters names and defaults. I don't think it's been updated since.
I didn't remember what word2vec_standalone.py was. I tracked that script down to PR #593, which doesn't talk about its motivation. But a standalone script makes more sense to me, and is also cleaner (for example avoids pickling issues when run as __main__).
So my preference would be to get rid of __main__ in word2vec.py, and keep word2vec_standalone.py. We definitely don't need both.
Thank you for the clarification. So, going forward, I'll just write some tests for cython only implementations (which is a bit confusing since I'm working on a version with still the two implementations, hence why I wanted to rebase in the first place), and leave to you clarifying between what's to be kept and dropped between word2vec.py and word2vec_standalone.py.
I was wondering about the status of this PR. Will it be fixed in gensim 4.0.0? Erroneous loss computation happens only in Word2Vec or in other classes too (i.e., FastText)? Thanks in advance