word2vec icon indicating copy to clipboard operation
word2vec copied to clipboard

Context across sentences, by mistake?

Open joelb-git opened this issue 7 years ago • 3 comments

SortVocab is removing the sentence end marker "</s>" from the index 0 in the vocab. I think the intent of the original word2vec code is that newlines are replaced with the "</s>" token, which is found as 0 in the vocab. Then context does not cross sentences. However, because of this problem, looking up "</s>" actually returns -1, an OOV word, and we end up with each "sentence" filling the max 1000 word buffer.

I added printf statements before and after the call to SortVocab and ran on trivial input to demonstrate.

[~/views/word2vec (master *)]
$ git log | head -1
commit 80be14a89b260df5cfca19a65cbfe52ba15db7ba

$ git diff
diff --git a/src/word2vec.c b/src/word2vec.c
index 2f892ea..7bd6392 100644
--- a/src/word2vec.c
+++ b/src/word2vec.c
@@ -309,7 +309,11 @@ void LearnVocabFromTrainFile() {
     } else vocab[i].cn++;
     if (vocab_size > vocab_hash_size * 0.7) ReduceVocab();
   }
+
+  printf("before: </s> index = %d\n", SearchVocab("</s>"));
   SortVocab();
+  printf("after:  </s> index = %d\n", SearchVocab("</s>"));
+
   if (debug_mode > 0) {
     printf("Vocab size: %lld\n", vocab_size);
     printf("Words in train file: %lld\n", train_words);

$ make -C src
...
$ echo foo bar baz >in.txt
$ bin/word2vec -train in.txt
Starting training using file in.txt
before: </s> index = 0
after:  </s> index = -1   <------- oops!
Vocab size: 1
Words in train file: 0

I also verified that the original word2vec code did not have this problem.

joelb-git avatar Jun 22 '17 21:06 joelb-git

Hey @joelb-git, just discovered that. Did you find out more in the meantime?

Simsso avatar May 28 '19 14:05 Simsso

Hi @Simsso - no, I had no response on this. This was a while ago. I think I ended up just using the original code instead, at https://github.com/tmikolov/word2vec.git

joelb-git avatar May 28 '19 15:05 joelb-git

Thx, will do the same!

Simsso avatar May 28 '19 15:05 Simsso