spacyr Incorporation of pre-trained word embeddings functionality

spaCy now has this: https://spacy.io/usage/vectors-similarity

Maybe we want to make this functionality available in spacyr.

Any feedback/suggestions from users are welcome.

May 27 '19 11:05 amatsuo

I made some tries for this option. Install branch issue-171 and try the following:

library(spacyr)
# spacy_download_langmodel("en_core_web_md")
spacy_initialize("en_core_web_md") # or spacy_initialize("en_core_web_ld") 
txt <- "To make them compact and fast, spaCy’s small models (all packages that end in sm) don’t ship with word vectors, and only include context-sensitive tensors. "
out <- spacy_parse(txt, embedding = TRUE)
attr(out, "embedding")

Jun 30 '19 06:06 amatsuo

Hi, Just tested this feature, works fine with my configuration. Think that it definitely will be useful for users that need to leverage state of the art NLP approaches while sticking to their favorite [R] langage. Malick.

Jun 30 '19 09:06 malickpaye

Just experimented with this. A few comments on the branch.

Since this is looking up the tokens from the language model using Token.vector(), we don't really need to do this at the parsing stage. Instead, we could create a set of functions such as wordvectors_lookup() that look up the word vectors for a spacy_parsed object, but store them with one vector per type.

wordvectors_apply(x, wordvectors) would could apply those created by wordvectors_lookup() to a spacy_parsed object x, or to a quanteda::tokens() object. This means that we could be making this lookup function available to any package that has tokens or words.

We could create similar functions to weight or replace tokens with their L2 normed vector scores, similar to Token.vector_norm().

Jul 04 '19 10:07 kbenoit

+1 for this. Would be great to access embeddings and then, for example, the similarity() function described on that page you first linked to (https://spacy.io/usage/vectors-similarity)

Oct 01 '19 14:10 cainesap

@kbenoit and all

I've implemented the very first version of twi function (spacy_wordvectors_lookup and spacy_wordvectors_apply. Please test and give some feedback.

The following is one of the expected use cases: Calculating the similarity of short texts

## devtools::install_github("quanteda/spacyr", ref = "issue-171")


library(quanteda)
library(tidyverse)
library(spacyr)
library(DBI)

spacy_initialize(model = "en_core_web_md")

# data from here
# https://www.kaggle.com/crowdflower/twitter-airline-sentiment
db <- dbConnect(RSQLite::SQLite(), "~/Downloads/database.sqlite")

set.seed(20191024)
corpus_tw <- tbl(db, "Tweets") %>% as_tibble() %>% sample_n(1000) %>% 
    distinct(text, .keep_all = TRUE) %>%
    corpus(docid_field = "tweet_id")

twitter_parsed <- spacy_parse(corpus_tw, additional_attributes = "is_stop") 

wordvectors <- spacy_wordvectors_lookup(twitter_parsed)

wordvec_matrix <- spacy_wordvectors_apply(twitter_parsed, wordvectors)

# convert the matrix to tibble for the further manipulation
wordvec_tb <- wordvec_matrix %>% 
    as_tibble(.name_repair = "universal") %>%
    rename_all(str_replace, "\\D+", "D") %>% 
    bind_cols(twitter_parsed)

# calculate the average of the wordvector in the text
doc_vec_avg <- wordvec_tb %>% 
    filter(!is_stop) %>%
    group_by(doc_id) %>%
    summarise_at(1:300, mean) %>% ungroup()

# convert it to dfm for similarity calculation (since the matrix is dense, other package might work faster for similarity calculation)
temp <- doc_vec_avg %>%
    select(-1) %>% 
    as.matrix() %>% as.dfm

rownames(temp) <- paste(doc_vec_avg$doc_id)

simil_stat <- textstat_simil(temp, method = "cosine") %>% as.data.frame() %>% 
    sample_n(1000) %>%
    arrange(-cosine) %>% 
    mutate_at(1:2, as.character)

# print the output
for(i in seq(10)){
    cat(paste0("similarity: ", simil_stat$cosine[i], "\n",
               "doc1: ", corpus_tw[simil_stat$document1[i]], "\n",
               "doc2: ", corpus_tw[simil_stat$document2[i]], "\n\n"))
}

Oct 24 '19 11:10 amatsuo

@amatsuo tested and working fine! Thank you for implementing and updating us :)

Oct 25 '19 20:10 cainesap

How about

# works on a spacyr parsed object
wordvectors_get.spacyr_parsed(x, model)

# works on a named list of characters, such as from spacy_tokenize()
wordvectors_get.list(x, model)

to return a v x d matrix, where v is the number of types (unique tokens) and d is the number of dimensions. This is a dense matrix.

# attaches special attribute of wordvectors to the object
wordvectors_put.spacyr_parsed(x, wordvectors)

We don't do this for a list, since we can do that instead in quanteda::as.tokens().

The important thing here is

when we get the word vectors from a language model, it's not ntoken x d, but rather ntype x d, so more efficient (and can be linked later by using the token label as a key); and
we can "put" any word vectors from any source, not just those taken from a spaCy language model. So the infrastructure is more general than spaCy, to allow us to take pre-trained word vectors from other sources, such as fasttext, BERT, Elmo, etc.

Feb 29 '20 01:02 kbenoit

spacyr spacyr copied to clipboard

Incorporation of pre-trained word embeddings functionality

spacyr
spacyr copied to clipboard