papernotes icon indicating copy to clipboard operation
papernotes copied to clipboard

Unsupervised Text Style Transfer using Language Models as Discriminators

Open howardyclo opened this issue 6 years ago • 1 comments

Metadata

  • Authors: Zichao Yang, Zhiting Hu, Chris Dyer, Eric P. Xing, Taylor Berg-Kirkpatrick
  • Organization: CMU and DeepMind
  • Conference: NIPS 2018
  • Paper: https://arxiv.org/pdf/1805.11749.pdf
  • Publish Date: 2018.05

howardyclo avatar Sep 25 '18 12:09 howardyclo

Summary

  • This paper proposes to use a target-domain language model as a discriminator in GAN training.
  • The motivation: The error signal for generator provided by a binary-classifier discriminator is usually unstable and insufficient.
  • The empirical results show that it is possible to eliminate adversarial steps during training.
  • Introduced a complete list of related work, such as non-parallel transfer in NLP, GANs, style transfer in CV, LM for reranking.

Unsupervised Text Style Transfer

  • Review the current approaches from Hu et al. and Shen et al.
  • Input: unpaired two text dataset X = {x_1, ..., x_m}, Y = {y_1, ..., y_n} and their corresponding styles v_x, v_y (can be a label embedding).
  • Use an encoder E to encode sentence x/y to get content vector z_x (z_y) = E(x, v_x) (E(y, v_y)).
  • Use an decoder G to generate style-transferred sentence G(z, v). (x, y notation is ignored).
  • To guarantee z_x and z_y follow the same distribution, assume p(z) follows a prior distribution and add a KL-divergence regularization on z_x, z_y. (Becomes VAE).
  • However, the posterior distribution of z fails to capture content of a sentence.
  • To capture the desired styles in generated sentence, Hu et al. additionally use a style classifier on the generated samples, and the decoder G is trained to maximize the accuracy of the style classifier.
  • Shen et al. use GAN-training to align z distribution.

Language Models as Discriminators

Model Architectures

Objective

  • In equation (1) & (2), train LM with GAN-training.
  • However, since LM is a structured discriminator, we hope that LM only assign high perplexity for negative (fake) sentence, hence negative samples may not be necessary. They add a weight γ to the loss of negative samples for investigating the necessity. If γ = 0, the LM is simply trained on a real sentence.
  • Experiment shows that adding negative samples sometimes improve the results. However, empirically that using negative samples makes the training very unstable and the model diverges easily.

Training

  • Train LMs according to equation (1) & (2).
  • Minimize reconstruction loss.

Continuous approximation (Figure 2)

  • Use Gumbel-softmax to approximate the output sentence from G, and then compute cross-entropy loss using LM.
  • Use weighted average of embedding to LM. (See paper for detail)

Question: Why not simply use policy gradient?

Overcoming mode collapse

  • Preliminary experiments show that LM prefers short sentences.
  • Two tricks are applied:
    • Normalize the loss with sentence length.
    • Fix the length of generated sentence to be the same of input sentence.

Reference

howardyclo avatar Sep 25 '18 12:09 howardyclo