papernotes copied to clipboard
Unsupervised Text Style Transfer using Language Models as Discriminators
- Authors: Zichao Yang, Zhiting Hu, Chris Dyer, Eric P. Xing, Taylor Berg-Kirkpatrick
- Organization: CMU and DeepMind
- Conference: NIPS 2018
- Paper:
- Publish Date: 2018.05
- This paper proposes to use a target-domain language model as a discriminator in GAN training.
- The motivation: The error signal for generator provided by a binary-classifier discriminator is usually unstable and insufficient.
- The empirical results show that it is possible to eliminate adversarial steps during training.
- Introduced a complete list of related work, such as non-parallel transfer in NLP, GANs, style transfer in CV, LM for reranking.
Unsupervised Text Style Transfer
- Review the current approaches from Hu et al. and Shen et al.
- Input: unpaired two text dataset X = {x_1, ..., x_m}, Y = {y_1, ..., y_n} and their corresponding styles v_x, v_y (can be a label embedding).
- Use an encoder E to encode sentence x/y to get content vector z_x (z_y) = E(x, v_x) (E(y, v_y)).
- Use an decoder G to generate style-transferred sentence G(z, v). (x, y notation is ignored).
- To guarantee z_x and z_y follow the same distribution, assume p(z) follows a prior distribution and add a KL-divergence regularization on z_x, z_y. (Becomes VAE).
- However, the posterior distribution of z fails to capture content of a sentence.
- To capture the desired styles in generated sentence, Hu et al. additionally use a style classifier on the generated samples, and the decoder G is trained to maximize the accuracy of the style classifier.
- Shen et al. use GAN-training to align z distribution.
Language Models as Discriminators
Model Architectures
- In equation (1) & (2), train LM with GAN-training.
- However, since LM is a structured discriminator, we hope that LM only assign high perplexity for negative (fake) sentence, hence negative samples may not be necessary. They add a weight γ to the loss of negative samples for investigating the necessity. If γ = 0, the LM is simply trained on a real sentence.
- Experiment shows that adding negative samples sometimes improve the results. However, empirically that using negative samples makes the training very unstable and the model diverges easily.
- Train LMs according to equation (1) & (2).
- Minimize reconstruction loss.
Continuous approximation (Figure 2)
- Use Gumbel-softmax to approximate the output sentence from G, and then compute cross-entropy loss using LM.
- Use weighted average of embedding to LM. (See paper for detail)
Question: Why not simply use policy gradient?
Overcoming mode collapse
- Preliminary experiments show that LM prefers short sentences.
- Two tricks are applied:
- Normalize the loss with sentence length.
- Fix the length of generated sentence to be the same of input sentence.
- Toward controlled generation of text by Hu et al. ICML 2017.
- Style transfer from non-parallel text by cross-alignment by Shen et al. NIPS 2017.