spacyr
spacyr copied to clipboard
Incorporation of pre-trained word embeddings functionality
spaCy now has this: https://spacy.io/usage/vectors-similarity
Maybe we want to make this functionality available in spacyr
.
Any feedback/suggestions from users are welcome.
I made some tries for this option. Install branch issue-171
and try the following:
library(spacyr)
# spacy_download_langmodel("en_core_web_md")
spacy_initialize("en_core_web_md") # or spacy_initialize("en_core_web_ld")
txt <- "To make them compact and fast, spaCy’s small models (all packages that end in sm) don’t ship with word vectors, and only include context-sensitive tensors. "
out <- spacy_parse(txt, embedding = TRUE)
attr(out, "embedding")
Hi, Just tested this feature, works fine with my configuration. Think that it definitely will be useful for users that need to leverage state of the art NLP approaches while sticking to their favorite [R] langage. Malick.
Just experimented with this. A few comments on the branch.
Since this is looking up the tokens from the language model using Token.vector()
, we don't really need to do this at the parsing stage. Instead, we could create a set of functions such as wordvectors_lookup()
that look up the word vectors for a spacy_parsed object, but store them with one vector per type.
wordvectors_apply(x, wordvectors)
would could apply those created by wordvectors_lookup()
to a spacy_parsed
object x
, or to a quanteda::tokens()
object. This means that we could be making this lookup function available to any package that has tokens or words.
We could create similar functions to weight or replace tokens with their L2 normed vector scores, similar to Token.vector_norm()
.
+1 for this. Would be great to access embeddings and then, for example, the similarity() function described on that page you first linked to (https://spacy.io/usage/vectors-similarity)
@kbenoit and all
I've implemented the very first version of twi function (spacy_wordvectors_lookup
and spacy_wordvectors_apply
. Please test and give some feedback.
The following is one of the expected use cases: Calculating the similarity of short texts
## devtools::install_github("quanteda/spacyr", ref = "issue-171")
library(quanteda)
library(tidyverse)
library(spacyr)
library(DBI)
spacy_initialize(model = "en_core_web_md")
# data from here
# https://www.kaggle.com/crowdflower/twitter-airline-sentiment
db <- dbConnect(RSQLite::SQLite(), "~/Downloads/database.sqlite")
set.seed(20191024)
corpus_tw <- tbl(db, "Tweets") %>% as_tibble() %>% sample_n(1000) %>%
distinct(text, .keep_all = TRUE) %>%
corpus(docid_field = "tweet_id")
twitter_parsed <- spacy_parse(corpus_tw, additional_attributes = "is_stop")
wordvectors <- spacy_wordvectors_lookup(twitter_parsed)
wordvec_matrix <- spacy_wordvectors_apply(twitter_parsed, wordvectors)
# convert the matrix to tibble for the further manipulation
wordvec_tb <- wordvec_matrix %>%
as_tibble(.name_repair = "universal") %>%
rename_all(str_replace, "\\D+", "D") %>%
bind_cols(twitter_parsed)
# calculate the average of the wordvector in the text
doc_vec_avg <- wordvec_tb %>%
filter(!is_stop) %>%
group_by(doc_id) %>%
summarise_at(1:300, mean) %>% ungroup()
# convert it to dfm for similarity calculation (since the matrix is dense, other package might work faster for similarity calculation)
temp <- doc_vec_avg %>%
select(-1) %>%
as.matrix() %>% as.dfm
rownames(temp) <- paste(doc_vec_avg$doc_id)
simil_stat <- textstat_simil(temp, method = "cosine") %>% as.data.frame() %>%
sample_n(1000) %>%
arrange(-cosine) %>%
mutate_at(1:2, as.character)
# print the output
for(i in seq(10)){
cat(paste0("similarity: ", simil_stat$cosine[i], "\n",
"doc1: ", corpus_tw[simil_stat$document1[i]], "\n",
"doc2: ", corpus_tw[simil_stat$document2[i]], "\n\n"))
}
@amatsuo tested and working fine! Thank you for implementing and updating us :)
How about
# works on a spacyr parsed object
wordvectors_get.spacyr_parsed(x, model)
# works on a named list of characters, such as from spacy_tokenize()
wordvectors_get.list(x, model)
to return a v x d matrix, where v is the number of types (unique tokens) and d is the number of dimensions. This is a dense matrix.
# attaches special attribute of wordvectors to the object
wordvectors_put.spacyr_parsed(x, wordvectors)
We don't do this for a list, since we can do that instead in quanteda::as.tokens()
.
The important thing here is
- when we get the word vectors from a language model, it's not ntoken x d, but rather ntype x d, so more efficient (and can be linked later by using the token label as a key); and
- we can "put" any word vectors from any source, not just those taken from a spaCy language model. So the infrastructure is more general than spaCy, to allow us to take pre-trained word vectors from other sources, such as fasttext, BERT, Elmo, etc.