sentence2vec icon indicating copy to clipboard operation
sentence2vec copied to clipboard

Deep sentence embedding using Sequence to Sequence learning

trafficstars

Deep sentence embedding using Sequence to Sequence learning

screenshot

Installing

  1. Install Torch.

  2. Install the following additional Lua libs:

    luarocks install nn
    luarocks install rnn
    luarocks install penlight
    

    To train with CUDA install the latest CUDA drivers, toolkit and run:

    luarocks install cutorch
    luarocks install cunn
    

    To train with opencl install the lastest Opencl torch lib:

    luarocks install cltorch
    luarocks install clnn
    
  3. Download the Cornell Movie-Dialogs Corpus and extract all the files into data/cornell_movie_dialogs.

Training

th train.lua [-h / options]

Use the --dataset NUMBER option to control the size of the dataset. Training on the full dataset takes about 5h for a single epoch.

The model will be saved to data/model.t7 after each epoch if it has improved (error decreased).

Getting a pretrained model

Download:

  1. The pretraned model.t7
  2. Vocabulary vocab.t7

Put them into the data directory.

Extracting embeddings from sentences

Run the following command

th -i extract_embeddings.lua --model_file data/model.t7 --input_file data/test_sentences.txt --output_file data/embeddings.t7 --cuda

To visualize 2D projections of the embeddings refer to: example.ipynb

Acknowledgments

This implementation utilizes code from Marc-André Cournoyer's repo

License

MIT License