SentenceRepresentation How to retrieve the embeddings after training?

Thank you for releasing the code! I have trained the model using my own data from scratch. But I am unable to understand how to retrieve the embeddings for the training data, or for new data. Any help will be appreciated.

May 09 '16 10:05 satarupaguha11

Has anyone been able to figure this out yet? I have the same question

Nov 29 '16 18:11 GrantMHodgson

@ghod5 I haven't been able to.

Nov 30 '16 14:11 satarupaguha11

I think the final sentence embedding is computed by doing average of word vectors. After training the model, you can do this on your own if you get access to the word vectors.

Feb 11 '17 08:02 ganeshjawahar

Yes, that's correct - let me know if you need further clarification!

On 11 February 2017 at 08:18, Ganesh [email protected] wrote:

I think the final sentence embedding is computed by doing average of word vectors. After training the model, you can do this on your own if you get access to the word vectors.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/fh295/SentenceRepresentation/issues/2#issuecomment-279129404, or mute the thread https://github.com/notifications/unsubscribe-auth/AH6L9u7oLN1se8VUk9xe8sCOnN9-aaavks5rbW7agaJpZM4IaBCY .

--

Felix Hill University of Cambridge [email protected]

http://www.cl.cam.ac.uk/~fh295/

Feb 12 '17 19:02 fh295

I tried training the model as well (by running bash train_model.sh), but I ran into the problem that the vectors all consist of nan.

Command line: export PYTHONPATH="${PYTHONPATH}:$PWD/gensim"

Inside Python:

>>> from gensim.models.fastsent import FastSent
No handlers could be found for logger "gensim.models.doc2vec"
>>> model = FastSent.load('FastSent_no_autoencoding_300_10_0')
>>> model.sentence_similarity('i have a cat', 'i have a dog')
nan

So I tried to get the actual vectors:

>>> model['apple']
array([ nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
        nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
        nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
        nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
        nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
        nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
        nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
        nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
        nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
        nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
        nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
        nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
        nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
        nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
        nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
        nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
        nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
        nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
        nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
        nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
        nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
        nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
        nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
        nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
        nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
        nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
        nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
        nan,  nan,  nan], dtype=float32)

I'm assuming something went wrong with the training, but I don't know what. For completeness' sake, here are syn0 and syn1:

>>> x = numpy.load('FastSent_no_autoencoding_300_10_0.syn0.npy')
>>> x
array([[ nan,  nan,  nan, ...,  nan,  nan,  nan],
       [ nan,  nan,  nan, ...,  nan,  nan,  nan],
       [ nan,  nan,  nan, ...,  nan,  nan,  nan],
       ...,
       [ nan,  nan,  nan, ...,  nan,  nan,  nan],
       [ nan,  nan,  nan, ...,  nan,  nan,  nan],
       [ nan,  nan,  nan, ...,  nan,  nan,  nan]], dtype=float32)
>>> x = numpy.load('FastSent_no_autoencoding_300_10_0.syn1.npy')
>>> x
array([[ nan,  nan,  nan, ...,  nan,  nan,  nan],
       [ nan,  nan,  nan, ...,  nan,  nan,  nan],
       [ nan,  nan,  nan, ...,  nan,  nan,  nan],
       ...,
       [ nan,  nan,  nan, ...,  nan,  nan,  nan],
       [ nan,  nan,  nan, ...,  nan,  nan,  nan],
       [  0.,   0.,   0., ...,   0.,   0.,   0.]], dtype=float32)

So I'm kind of hoping for either of two things:

Could you tell me what caused the training to go wrong like this? (I did everything exactly as the README says..)
Could you upload the pre-trained vectors somewhere?

Mar 18 '17 17:03 evanmiltenburg

I found a possible cause: I modified trainFastSent.py to replace model.save with model.save_fastsent_format, and then trained again on 2m sentences. The resulting file did have the vectors in them. Now I'll try the 70m sentences again. Fingers crossed!

Mar 22 '17 15:03 evanmiltenburg

Hi @evanmiltenburg , just wondering if you ever got this to work. I'm having the same problems as you. I just used model.save_fastsent_format (and model.load_fastsent_format) after training on the full 70m, but I'm still getting NaN's.

Apr 25 '18 03:04 sosuperic

Not sure whether I ever got this to work.

I would suggest trying StarSpace if you want to have sentence embeddings. Or just use average word embeddings, using FastText/word2vec embeddings.

There also seem to be several versions (pull requests) in the Gensim repo, though they never seem to have been merged.

Apr 25 '18 07:04 evanmiltenburg

SentenceRepresentation SentenceRepresentation copied to clipboard

How to retrieve the embeddings after training?

SentenceRepresentation
SentenceRepresentation copied to clipboard