text2vec icon indicating copy to clipboard operation
text2vec copied to clipboard

tcm (by `create_tcm`) is not documented.

Open otoomet opened this issue 1 year ago • 1 comments

I am puzzled what exactly is TCM (term co-occurrence matrix). The documentation of create_tcm just tells that

This is a function for constructing a term-co-occurrence matrix(TCM). TCM matrix usually used with GloVe word embedding model.

and that its value is

dgTMatrix TCM matrix

Pennington, Socher and Manning, when introducing GloVe, define

matrix of word-word co-occurrence counts be denoted by X, whose entries X$_{ij}$ tabulate the number of times word $j$ occurs in the context of word $i$

My reading is that this matrix should be symmetric, ie $X_{ij} = X_{ji}$ if the context is symmetric and weights are 1. However, consider a very simple example with window 1:

doc <- c("a b c b a")
it <- itoken(doc)
vocab <- create_vocabulary(it)
vectorizer <- vocab_vectorizer(vocab)
tcm <- create_tcm(it,
                  vectorizer,
                  skip_grams_window = 1,
                  skip_grams_window_context = "symmetric",
                  weights=1)
tcm

This results in

3 x 3 sparse Matrix of class "dgTMatrix"
  c a b
c . . 2
a . . 2
b . . .

This is clearly not symmetric, e.g there is no context for word "b". The rest of it makes sense--"c" has two "b"-s as context, and "a" has two "b"-s in a similar fashion.

Does the returned TCM only fill out the upper triangle? This seems to be confirmed when reading documentation for coherence.

I am happy to contribute with PR-s and such, but would like to hear from you before I do this.

otoomet avatar Mar 14 '23 05:03 otoomet

Hi, yes, the TCM matrix is symmetric, so we keep upper triangular to save memory. PR to update docs is appreciated.

dselivanov avatar Mar 20 '23 07:03 dselivanov