text2vec A Simple but Tough-to-Beat Baseline for Sentence Embeddings

Paper: http://104.155.136.4:3000/pdf?id=SyK00v5xx Blog post: http://www.offconvex.org/2016/02/14/word-embeddings-2/

Looks like an interesting idea

Nov 08 '16 15:11 zachmayer

Thx! Have been subscribed to offconvex blog for quite some time :-) Another thing I want to try - http://www.offconvex.org/2016/07/10/embeddingspolysemy/. I even created rksvd repo to port k-svd algorithm, but can't find time to finish it =(

Nov 08 '16 17:11 dselivanov

Hi, thanks for creating this super fast package. I use it a lot. I am trying to use the glove embeddings to create sentence representations. My first attempt is to just to average the word embeddings per sentence. I can figure it out using other packages like cleanNLP, the cleanNLP tokenizer provides a sentence id. I would prefer to stay within the text2vec-verse. Do you think it is possible to average the embeddings per sentence using the current functions in the package? Thanks for your help.

Jul 03 '17 15:07 bob-rietveld

@good-marketing, thats easy with a little bit of linear algebra :-) (however I will probably create model for this).

Below I will suppose you already have dtm - document-term matrix with word counts and word_vectors - word embeddings.

common_terms = intersect(colnames(dtm), rownames(word_vectors) )
dtm_averaged =  normalize(dtm[, common_terms], "l1")
# you can re-weight dtm above with tf-idf instead of "l1" norm
sentence_vectors = dtm_averaged %*% word_vectors[common_terms, ]

Let me know if code above is not clear.

Jul 03 '17 16:07 dselivanov

Thanks for the prompt answer. I am able to run the code, now I'll try to figure what to make of it ;-)

Jul 03 '17 20:07 bob-rietveld

Hi Dimitri,

I was looking at the results using the method you mentioned. The resulting sentence_vectors are now a matrix with n documents X w averaged word vectors. The problem I have is that I'd like a sentence representation, not a document representation, or am I misinterpreting your solution.

One thought I had was to split the documents into sentences and then create a dtm. Essentially each sentence is then a document, and I can apply the algebra you posted. I guess the dtm will be a lot more sparse, not sure what the effect will be. Do you think this is a 'correct' approach? Thanks for your help.

Jul 10 '17 08:07 bob-rietveld

@good-marketing splitting documents into the sentences is way to go. So we just change level of granularity of our analysis. I think this approach is 100% correct, I would go myself the same way.

stringi::stri_split_* or stringr::str_split_* with proper boundary delimiter can help with splitting into sentences.

Jul 10 '17 08:07 dselivanov

Great, thanks for the superfast response. Would you recommend tokenize_sentences from tokenizer.....just wondering since you're also an package author there ;-)

Jul 10 '17 08:07 bob-rietveld

Yes, sure you can use is. Tokenizers just wraps stringi package and provides a bit more convenient interface for tokenization.

2017-07-10 12:57 GMT+04:00 Good Marketing [email protected]:

Great, thanks for the superfast response. Would you recommend tokenize_sentences from tokenizer.....just wondering since you're also an package author there ;-)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dselivanov/text2vec/issues/157#issuecomment-314046891, or mute the thread https://github.com/notifications/unsubscribe-auth/AE4u3VhvNcEMNWhQw0zhYPL4OEpR_bA1ks5sMed_gaJpZM4KsjMh .

-- Regards Dmitriy Selivanov

Jul 10 '17 09:07 dselivanov

I'll take a shot at it next month, will keep you posted!

Aug 13 '19 20:08 sfohr