word2vec-pytorch List boundary discards one token in the context window

List boundary discards one token in the context window

Open jonnybluesman opened this issue 2 years ago • 4 comments

https://github.com/Andras7/word2vec-pytorch/blob/36b93a503e8b3b5448abbc0e18f2a6bd3e017fc9/word2vec/data_reader.py#L102

I think i + boundary should include a + 1 to make it inclusive, otherwise the right context takes 1 token less in the resulting skipgrams.

Sep 09 '21 17:09 jonnybluesman

Why creating a boundary like this: boundary = np.random.randint(1, self.window_size)

and not use simply the window_size value instead of boundary?

Nov 16 '21 12:11 francesco-mollica

Why creating a boundary like this: boundary = np.random.randint(1, self.window_size)

and not use simply the window_size value instead of boundary?

Because with the random function you are implicitly giving "more importance" to the closest words in the neighbourhood, by creating more data with those "close" tokens.

Nov 16 '21 12:11 jonnybluesman

You wanna say explicitly? So why not reduce the window size and fix it instead of using a boundary? This use of boundary is in other implementations? Just to be clear, this implementation uses a random window size that is in the range (1, window_size)? Is it correct that the boundary changes with each new sentence? thanks for the quick response!

Nov 16 '21 13:11 francesco-mollica

Does the concept of boundary can be apply to cbow-style? I implemented it and i'm in stuck because the size of the context varies from phrase to phrase as boundary changes as well and put it all in a unique tensor create me big problems!

Nov 20 '21 12:11 francesco-mollica

word2vec-pytorch word2vec-pytorch copied to clipboard

List boundary discards one token in the context window

word2vec-pytorch
word2vec-pytorch copied to clipboard