word2vec-norm-experiments
                                
                                 word2vec-norm-experiments copied to clipboard
                                
                                    word2vec-norm-experiments copied to clipboard
                            
                            
                            
                        if the meaningless vector is not at zero, what is?
We calculate a pre-image of the zero vector. Note that in word2vec CBOW the sum of the word vectors for the word in the context is averaged -- equivalently, we are mapping a probability distribution over the vocabulary, not a word count vector. Thus we are looking for a preimage that can be considered as a probability distribution. So values non-negative summing to one.
We use NNLS for this. We add the constraint that the entries sum to 1 by adding a column of 1s to the word vector matrix and augmenting the zero vector (which we wanted the preimage of) with an extra component, value 1.
The result is that the desired preimage is one-hot on </s>.
from scipy.optimize import nnls
A = np.array(vectors).transpose()
b = np.zeros((A.shape[0],))
Aaug = np.concatenate((A, np.ones((1, A.shape[1]))))
baug = np.concatenate((b, np.ones((1,))))
p = nnls(Aaug, baug)
print p[0].sum(), p[1]
p = p[0]
ps = pd.Series(p, index=vocab)
ps.order(ascending=False) # 99% </s>
</s>               0.990392
who                0.000749
health             0.000497
lacked             0.000493
pliny              0.000438
arcing             0.000425
did                0.000414
locomotives        0.000376
nucleic            0.000299
cú                 0.000275
ago                0.000237
jpg                0.000237
This is the same over multiple runs, so it doesn't seem to be dependent upon the random initialisation of NNLS.
The vector of </s> does not change from init
words are delimited by space, tab and linefeeds (\n). while spaces and tabs are treated as delimiters and otherwise ignored, linefeeds are treated as delimiters and also transcribed as the symbol </s>.
// Reads a single word from a file, assuming space + tab + EOL to be word boundaries
void ReadWord(char *word, FILE *fin) {
  int a = 0, ch;
  while (!feof(fin)) {
    ch = fgetc(fin);
    if (ch == 13) continue; // this is the windows carriage return, just skip over it
    if ((ch == ' ') || (ch == '\t') || (ch == '\n')) {
      if (a > 0) {
        if (ch == '\n') ungetc(ch, fin);
        break;
      }
      if (ch == '\n') {
        strcpy(word, (char *)"</s>");
        return;
      } else continue;
    }
    word[a] = ch;
    a++;
    if (a >= MAX_STRING - 1) a--;   // Truncate too long words
  }
  word[a] = 0;
}
When the vocabulary is sorted, </s> is kept at position 0.
When it comes time to delimit sentences (blocks of text), however, </s> is NOT included in the sentence:
    if (sentence_length == 0) {
      while (1) {
        word = ReadWordIndex(fi);
        if (feof(fi)) break;
        if (word == -1) continue;
        word_count++;
        if (word == 0) break;
        // The subsampling randomly discards frequent words while keeping the ranking same
        if (sample > 0) {
          real ran = (sqrt(vocab[word].cn / (sample * train_words)) + 1) * (sample * train_words) / vocab[word].cn;
          next_random = next_random * (unsigned long long)25214903917 + 11;
          if (ran < (next_random & 0xFFFF) / (real)65536) continue;
        }
        sen[sentence_length] = word;
        sentence_length++;
        if (sentence_length >= MAX_SENTENCE_LENGTH) break;
      }
      sentence_position = 0;
    }
Thus the word vector for </s> is never updated, and hence retains its initial value.  Its closeness to zero, then, is just a figment of the initialisation.  See http://building-babylon.net/2015/07/13/word2vec-weight-initialisation/ for example (we worked here in dimension 100, and the vector for </s> has a norm of 0.0279).
Since it is never updated, it is not a meaningful word vector, and we should exclude it from the NNLS computation above.
Excluding </s>
from scipy.optimize import nnls
A = np.array(vectors.iloc[1:]).transpose()
b = np.zeros((A.shape[0],))
Aaug = np.concatenate((A, np.ones((1, A.shape[1]))))
baug = np.concatenate((b, np.ones((1,))))
p = nnls(Aaug, baug)
print 'sum %f' % p[0].sum()
print 'error %f' % p[1]
ps = pd.Series(p[0], index=vocab[1:])
print ps.order(ascending=False)
sum 0.984734
error 0.123556
JETS_8             0.041570
FROM_13            0.041390
ebs                0.038148
faris              0.032459
perennially        0.031211
dearth             0.030563
tropic             0.026512
excepting          0.026035
focussing          0.024341
donut              0.020728
submerge           0.020249
multitudes         0.019515
predictably        0.019247
toppers            0.018824
cohorts            0.018350
...
Ran twice, same results. Only 0.16% of the words were non-zero. It is difficult to interpret.
In the case of negative sampling, it might be sensible to expect that the noise distribution used for sampling corresponds to the vector 0. But the meaningless vector is also not at zero when hierarchical softmax is used, so this can't be the full story.