ruimtehol icon indicating copy to clipboard operation
ruimtehol copied to clipboard

Option of weighting words

Open guivivi opened this issue 4 years ago • 3 comments

Dear Jan,

Many thanks for this outstanding package.

I am learning the second example of the help file for ?embed_sentencespace and I have the following question:

When obtaining the sentence similarities, I am wondering if there is a way to weight the words that make up the sentence. For example, in sentence <- "Wat zijn de cijfers qua doorstroming van 2016? let's say that I would like to emphasize that the most important word to find the similar sentences is 'cijfers'.

Is it possible to assign a weight to tell the algorithm to try to orientate to sentences that contain 'cijfers'?

Looking at the package manual, I see that there are some arguments related to weighting, namely, wordWeight and useWeight, but I do not know how they must be used.

Any help would be very much appreciated.

Kind regards,

Guillermo

guivivi avatar Jan 26 '21 09:01 guivivi

The package always starts from building a model based on a file. If you can construct a file which looks like this (see Starspace README https://github.com/facebookresearch/StarSpace/blob/master/README.md), you can build a model with specific useWeight = TRUE

word_1:wt_1 word_2:wt_2 ... word_k:wt_k __label__1:lwt_1 ... __label__r:lwt_r

It might as well that you are looking for something called word mover distance (http://proceedings.mlr.press/v37/kusnerb15.pdf)? While I was working on R package doc2vec (https://www.bnosac.be/index.php/blog/103-doc2vec-in-r and https://github.com/bnosac/doc2vec), the C++ backend there allows to provide weights to certain words as well but I removed that functionality last week in order to comply to CRAN policies. R package text2vec from @dselivanov has a function called RelaxedWordMoversDistance, based on which you can plug in the embeddings coming from either R packages ruimtehol, text2vec, word2vec or doc2vec

And nothing stops you from calculating a different embedding for each document by using whichever linear combination of the word vectors that is coming out of these different packages.

jwijffels avatar Jan 26 '21 11:01 jwijffels

Hi Jan, many thanks for the insights.

Regarding creating the file with weights, I think I have been able to do it. Following the second example of embed_sentencespace, the idea is to paste an added column with the weights. This is an illustration for the case that I wanted to highlight the importance of the word 'cijfers':

library(udpipe)
data(dekamer, package = "ruimtehol")
x <- udpipe(dekamer$question, "dutch", tagger = "none", parser = "none", trace = 100)
x <- x[, c("doc_id", "sentence_id", "sentence", "token")]

x <- x %>% 
  filter(doc_id == "doc115", sentence_id == "7") %>%
  mutate(weight = ifelse(token == "cijfers", 1, 0))
x
 doc_id   sentence_id                            sentence         token   weight
doc115                    7   Kunt u cijfers meedelen?          Kunt            0
doc115                    7   Kunt u cijfers meedelen?                u            0
doc115                    7   Kunt u cijfers meedelen?         cijfers            1
doc115                    7   Kunt u cijfers meedelen?   meedelen            0
doc115                    7   Kunt u cijfers meedelen?                 ?            0

x <- split(x, f = x$doc_id)
x <- sapply(x, FUN = function(tokens) {
  sentences <- split(tokens, tokens$sentence_id)
  sentences <- sapply(sentences, FUN = function(x) paste(x$token, ":", x$weight, sep = "", 
                                                         collapse = " "))
  paste(sentences, collapse = "\t")
})  
x
"Kunt:0 u:0 cijfers:1 meedelen:0 ?:0"

For anyone interested, the extended function is available at: https://www.uv.es/vivigui/docs/embed_sentencespace_weighted.R

Basically I have added the former paste(x$token, ":", x$weight, sep = "", collapse = " ") and the condition stopifnot(all(c("doc_id", "sentence_id", "token", "weight") %in% colnames(x)))

I have tried a couple of tests with embed_sentencespace_weighted(..., useWeight = TRUE) and indeed seems to take into account the added weigths.

Please correct me if I am wrong in my procedure.

I am going to learn now the word mover distance, an unknown concept to me so far.

guivivi avatar Jan 27 '21 09:01 guivivi

Looks correct to me

jwijffels avatar Jan 27 '21 09:01 jwijffels