text2vec
text2vec copied to clipboard
tcm (by `create_tcm`) is not documented.
I am puzzled what exactly is TCM (term co-occurrence matrix). The documentation of create_tcm
just tells that
This is a function for constructing a term-co-occurrence matrix(TCM). TCM matrix usually used with GloVe word embedding model.
and that its value is
dgTMatrix TCM matrix
Pennington, Socher and Manning, when introducing GloVe, define
matrix of word-word co-occurrence counts be denoted by X, whose entries X$_{ij}$ tabulate the number of times word $j$ occurs in the context of word $i$
My reading is that this matrix should be symmetric, ie $X_{ij} = X_{ji}$ if the context is symmetric and weights are 1. However, consider a very simple example with window 1:
doc <- c("a b c b a")
it <- itoken(doc)
vocab <- create_vocabulary(it)
vectorizer <- vocab_vectorizer(vocab)
tcm <- create_tcm(it,
vectorizer,
skip_grams_window = 1,
skip_grams_window_context = "symmetric",
weights=1)
tcm
This results in
3 x 3 sparse Matrix of class "dgTMatrix"
c a b
c . . 2
a . . 2
b . . .
This is clearly not symmetric, e.g there is no context for word "b". The rest of it makes sense--"c" has two "b"-s as context, and "a" has two "b"-s in a similar fashion.
Does the returned TCM only fill out the upper triangle? This seems to be confirmed when reading documentation for coherence
.
I am happy to contribute with PR-s and such, but would like to hear from you before I do this.
Hi, yes, the TCM matrix is symmetric, so we keep upper triangular to save memory. PR to update docs is appreciated.