A Simple but Tough-to-Beat Baseline for Sentence Embeddings
Paper: http://104.155.136.4:3000/pdf?id=SyK00v5xx Blog post: http://www.offconvex.org/2016/02/14/word-embeddings-2/
Looks like an interesting idea
Thx! Have been subscribed to offconvex blog for quite some time :-) Another thing I want to try - http://www.offconvex.org/2016/07/10/embeddingspolysemy/. I even created rksvd repo to port k-svd algorithm, but can't find time to finish it =(
Hi, thanks for creating this super fast package. I use it a lot. I am trying to use the glove embeddings to create sentence representations. My first attempt is to just to average the word embeddings per sentence. I can figure it out using other packages like cleanNLP, the cleanNLP tokenizer provides a sentence id. I would prefer to stay within the text2vec-verse. Do you think it is possible to average the embeddings per sentence using the current functions in the package? Thanks for your help.
@good-marketing, thats easy with a little bit of linear algebra :-) (however I will probably create model for this).
Below I will suppose you already have dtm - document-term matrix with word counts and word_vectors - word embeddings.
common_terms = intersect(colnames(dtm), rownames(word_vectors) )
dtm_averaged = normalize(dtm[, common_terms], "l1")
# you can re-weight dtm above with tf-idf instead of "l1" norm
sentence_vectors = dtm_averaged %*% word_vectors[common_terms, ]
Let me know if code above is not clear.
Thanks for the prompt answer. I am able to run the code, now I'll try to figure what to make of it ;-)
Hi Dimitri,
I was looking at the results using the method you mentioned. The resulting sentence_vectors are now a matrix with n documents X w averaged word vectors. The problem I have is that I'd like a sentence representation, not a document representation, or am I misinterpreting your solution.
One thought I had was to split the documents into sentences and then create a dtm. Essentially each sentence is then a document, and I can apply the algebra you posted. I guess the dtm will be a lot more sparse, not sure what the effect will be. Do you think this is a 'correct' approach? Thanks for your help.
@good-marketing splitting documents into the sentences is way to go. So we just change level of granularity of our analysis. I think this approach is 100% correct, I would go myself the same way.
stringi::stri_split_* or stringr::str_split_* with proper boundary delimiter can help with splitting into sentences.
Great, thanks for the superfast response. Would you recommend tokenize_sentences from tokenizer.....just wondering since you're also an package author there ;-)
Yes, sure you can use is. Tokenizers just wraps stringi package and
provides a bit more convenient interface for tokenization.
2017-07-10 12:57 GMT+04:00 Good Marketing [email protected]:
Great, thanks for the superfast response. Would you recommend tokenize_sentences from tokenizer.....just wondering since you're also an package author there ;-)
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dselivanov/text2vec/issues/157#issuecomment-314046891, or mute the thread https://github.com/notifications/unsubscribe-auth/AE4u3VhvNcEMNWhQw0zhYPL4OEpR_bA1ks5sMed_gaJpZM4KsjMh .
-- Regards Dmitriy Selivanov
I'll take a shot at it next month, will keep you posted!