CGMH icon indicating copy to clipboard operation
CGMH copied to clipboard

Codes for <CGMH: Constrained Sentence Generation by Metropolis-Hastings Sampling>

Constrained Sentence Generation via Metropolis-Hastings Sampling

Introduction

CGMH is a sampling based model for constrained sentence generation, which can be used in keyword-to-sentence generation, paraphrase, sentence correction and many other tasks.

Examples

  • Running example for parahrase: (All rejected proposal is omitted)
    what movie do you like most . ->
    which movie do you like most . (replace what with which) ->
    which movie do you like . (delete most) ->
    which movie do you like best . (insert best) ->
    which movie do you think best . (replace like with think) ->
    which movie do you think the best . (insert the) ->
    which movie do you think is the best . (insert is)

  • Running example for sentence correction: in the word oil price very high right now . ->
    in the word , oil price very high right now . (insert ,) ->
    in the word , oil prices very high right now . (replace price with prices) ->
    in the word , oil prices are very high right now . (insert are)

  • Extra Examples for sentence correction:
    origin: even if we are failed , we have to try to get a new things .->
    generated: even if we are failing , we have to try to get some new things .

    origin: in the word oil price very high right now .->
    generated: in the word , oil prices are very high right now .

    origin: the reason these problem occurs is also becayse of the exam .->
    generated: the reason these problems occur is also because of the exam .

Requirement

  • python

    • ==2.7
  • python packages

    • TensorFlow == 1.3.0 (other versions are not tested)
    • numpy
    • pickle
    • Rake (pip install python-rake)
    • zpar (pip install python-zpar, download model file from https://github.com/frcchang/zpar/releases/download/v0.7.5/english-models.zip and extract it to POS/english-models)
    • skipthoughts (needed only when config.sim=='skipthoughts')
    • en (get en from https://www.nodebox.net/code/index.php/Linguistics and put under liguistics/)
  • word embedding

    • If you want to try using word embedding for paraphrase, you should download or train a word embedding first and place it at config.emb_path and set config.emb_path='word_max'.

Language model download

  • For a pretrained language model, please download the following file and extract it under model/.
  • Correction and key-gen: https://drive.google.com/open?id=1L3q-xGD3lHNETfibERTIh-ciCXmzRs3i
  • Paraphrase: https://drive.google.com/open?id=1kTjnqO69CjwpBXwPtOPT6v7Ur7ro5nRR. Please put the .pkl file under data/quora

Word embedding download

  • For a pretrained word embedding, please download the following file.
  • Correction and key-gen: https://drive.google.com/open?id=1q79Dvrx3eapffHL4ApfrT0XpOgm3sKKF. Please put the .pkl file under data/1-billion
  • Paraphrase: https://drive.google.com/open?id=1ggEdFyLIrr9sjfG1SHxjyHgOYNKy3ySE. Please put the .pkl file under data/quora

Running

  • Training language models

    • For each task, first train a backward and a language model:
      set mode='forward' and mode='backward' in config.py successively.
      run python correction.py / paraphrase.py / key-gen.py to train each model.
  • Generation

    • For generating new sample for each tasks:
      set mode='use' and choose proper parameter in config.py.
      give inputs in 'input/input.txt' run python correction.py / paraphrase.py / key-gen.py to generate.
      outputs are in output.
  • Details

    • Make sure that paths for package and data are correctly set in 'config.py'.