SentenceRepresentation
SentenceRepresentation copied to clipboard
How to retrieve the embeddings after training?
Thank you for releasing the code! I have trained the model using my own data from scratch. But I am unable to understand how to retrieve the embeddings for the training data, or for new data. Any help will be appreciated.
Has anyone been able to figure this out yet? I have the same question
@ghod5 I haven't been able to.
I think the final sentence embedding is computed by doing average of word vectors. After training the model, you can do this on your own if you get access to the word vectors.
Yes, that's correct - let me know if you need further clarification!
On 11 February 2017 at 08:18, Ganesh [email protected] wrote:
I think the final sentence embedding is computed by doing average of word vectors. After training the model, you can do this on your own if you get access to the word vectors.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/fh295/SentenceRepresentation/issues/2#issuecomment-279129404, or mute the thread https://github.com/notifications/unsubscribe-auth/AH6L9u7oLN1se8VUk9xe8sCOnN9-aaavks5rbW7agaJpZM4IaBCY .
--
Felix Hill University of Cambridge [email protected]
http://www.cl.cam.ac.uk/~fh295/
I tried training the model as well (by running bash train_model.sh
), but I ran into the problem that the vectors all consist of nan
.
Command line:
export PYTHONPATH="${PYTHONPATH}:$PWD/gensim"
Inside Python:
>>> from gensim.models.fastsent import FastSent
No handlers could be found for logger "gensim.models.doc2vec"
>>> model = FastSent.load('FastSent_no_autoencoding_300_10_0')
>>> model.sentence_similarity('i have a cat', 'i have a dog')
nan
So I tried to get the actual vectors:
>>> model['apple']
array([ nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan], dtype=float32)
I'm assuming something went wrong with the training, but I don't know what. For completeness' sake, here are syn0 and syn1:
>>> x = numpy.load('FastSent_no_autoencoding_300_10_0.syn0.npy')
>>> x
array([[ nan, nan, nan, ..., nan, nan, nan],
[ nan, nan, nan, ..., nan, nan, nan],
[ nan, nan, nan, ..., nan, nan, nan],
...,
[ nan, nan, nan, ..., nan, nan, nan],
[ nan, nan, nan, ..., nan, nan, nan],
[ nan, nan, nan, ..., nan, nan, nan]], dtype=float32)
>>> x = numpy.load('FastSent_no_autoencoding_300_10_0.syn1.npy')
>>> x
array([[ nan, nan, nan, ..., nan, nan, nan],
[ nan, nan, nan, ..., nan, nan, nan],
[ nan, nan, nan, ..., nan, nan, nan],
...,
[ nan, nan, nan, ..., nan, nan, nan],
[ nan, nan, nan, ..., nan, nan, nan],
[ 0., 0., 0., ..., 0., 0., 0.]], dtype=float32)
So I'm kind of hoping for either of two things:
- Could you tell me what caused the training to go wrong like this? (I did everything exactly as the README says..)
- Could you upload the pre-trained vectors somewhere?
I found a possible cause: I modified trainFastSent.py
to replace model.save
with model.save_fastsent_format
, and then trained again on 2m sentences. The resulting file did have the vectors in them. Now I'll try the 70m sentences again. Fingers crossed!
Hi @evanmiltenburg , just wondering if you ever got this to work. I'm having the same problems as you. I just used model.save_fastsent_format
(and model.load_fastsent_format
) after training on the full 70m, but I'm still getting NaN's.
Not sure whether I ever got this to work.
I would suggest trying StarSpace if you want to have sentence embeddings. Or just use average word embeddings, using FastText/word2vec embeddings.
There also seem to be several versions (pull requests) in the Gensim repo, though they never seem to have been merged.