pytorch-sgns icon indicating copy to clipboard operation
pytorch-sgns copied to clipboard

Where is Expectation

Open zetyquickly opened this issue 6 years ago • 5 comments

screenshot from 2018-06-28 19-07-42

In this formula we have an expectation of $w_i$. That means for each pair of $(w_I, w_O)$ we should calculate this expectation. But as I can see in your code you are sampling n_negs of Negative Samples for each pair of $(w_I, w_O)$. Wouldn't that be more correct if we sample n_negs times $N$ of $w_i$ to obtain an empirical mean of expression in square brackets and after than accumulate n_negs of means?

zetyquickly avatar Jun 28 '18 16:06 zetyquickly

That would give more precise loss value, however, I'm afraid that applying this might consume too much time for training. I quickly wrote a sample code of your idea. With appropriate self.n_samples on the model, change the line below https://github.com/theeluwin/pytorch-sgns/blob/master/model.py#L66

if self.weights is not None:
    nwords = t.multinomial(self.weights, batch_size * context_size * self.n_negs * self.n_samples, replacement=True).view(batch_size, -1)
else:
    nwords = FT(batch_size, context_size * self.n_negs * self.n_samples).uniform_(0, self.vocab_size - 1).long()
ivectors = self.embedding.forward_i(iword).unsqueeze(2)
ovectors = self.embedding.forward_o(owords)
nvectors = self.embedding.forward_o(nwords).neg()
oloss = t.bmm(ovectors, ivectors).squeeze().sigmoid().log().mean(1)
nloss = t.bmm(nvectors, ivectors).squeeze().sigmoid().log().view(-1, context_size, self.n_negs, self.n_samples).mean(3).sum(2).mean(1)

theeluwin avatar Jun 30 '18 12:06 theeluwin

I am completely agree with you that will make training slower. My collegue and me concluded that your approach also leads to convergence. But I wish there would be thoughts about this formula and about implementation in the README or in code. Thank you

zetyquickly avatar Jul 01 '18 17:07 zetyquickly

If you have some good idea for implementation, please go ahead for PR. Otherwise, I'll close this issue. Thank you.

theeluwin avatar Jul 02 '18 08:07 theeluwin

Thanks for publishing this repo. I have found it very useful in improving the performance of my own word2vec implementation.

On this particular issue, I note that mean(), sum(), log(), and sigmoid() are all continuous and monotonically-increasing functions of their inputs. Thus, barring any issues with numerical stability, minimising:

-(t.bmm(ovectors, ivectors).squeeze().sigmoid().log().mean(1) + t.bmm(nvectors, ivectors).squeeze().sigmoid().log().view(-1, context_size, self.n_negs).sum(2).mean(1))

is equivalent to minimising:

-(t.bmm(ovectors, ivectors).mean() + t.bmm(nvectors, ivectors).mean())

Given that all of the vectors start out 'small', and do not become excessively large in the course of training, numerical stability does not seem to be an issue. (Indeed, if there were some problem with stability I suspect it might arise anyway, since each function is computed successively, rather than using some numerically-stable compound implementation in the manner of nn.NLLLoss().) So, even though the loss function is 'wrong', it has the same argmin as the 'correct' loss function, which is all we really care about.

The same argument should apply to the 'improved' computation. Ultimately, the order of application of mean() and sum() operations makes no difference to the location of the minimum of the loss function. So, in terms of the optimisation, all you are doing is increasing the number of nwords. But so long as you have 'enough' negative samples, you should be fine - as Mikolov et al. say in their paper, 'we are free to simplify NCE as long as the vector representations retain their quality.'

I have tried the above simplification in my own implementation. It seems to work, in the sense that it converges and produces a sensible-looking result, although I have not done any testing to check that it produces the same embeddings (all else being equal). However, the speed-up is hardly worth it - about 5% for small vectors, e.g. around 25 elements, but with much smaller relative benefits for larger vectors, since the matrix multiplications dominate the computation time. The advantage of retaining the log() and sigmoid() functions is that the magnitude of the loss function is about the same, regardless of the parameters of the model, rather than being, e.g., roughly proportional to the vector size.

Incidentally, as far as I can tell from the original word2vec code (https://github.com/dav/word2vec/blob/9b8b58001ba5d58babe1c62327a8501b62cd6179/src/word2vec.c#L529) they use a fixed number of negative samples (just five by default), and it looks like they compute the sigmoid() function (by table lookup), but not the log().

msummerfield avatar Sep 20 '18 09:09 msummerfield

@msummerfield Thanks for the detailed feedback! Awesome. Idea of using the 'faster' loss looks meaningful. The main reason I retained all the details is that the overall loss remains mathematically correct (e.g., loss = 1 means the prediction accuracy is 36.7%) but yes the long-calculation might suffer from numerical issues, since I've never cared about those. Again, thanks for the awesome feedback. It inspired me in many ways, so I hope it could be posted in some other spaces (like a blog) too.

theeluwin avatar Nov 01 '18 12:11 theeluwin