papernotes Unsupervised Text Style Transfer using Language Models as Discriminators

Unsupervised Text Style Transfer using Language Models as Discriminators

Open howardyclo opened this issue 6 years ago • 1 comments

Authors: Zichao Yang, Zhiting Hu, Chris Dyer, Eric P. Xing, Taylor Berg-Kirkpatrick
Organization: CMU and DeepMind
Conference: NIPS 2018
Paper: https://arxiv.org/pdf/1805.11749.pdf
Publish Date: 2018.05

Sep 25 '18 12:09 howardyclo

This paper proposes to use a target-domain language model as a discriminator in GAN training.
The motivation: The error signal for generator provided by a binary-classifier discriminator is usually unstable and insufficient.
The empirical results show that it is possible to eliminate adversarial steps during training.
Introduced a complete list of related work, such as non-parallel transfer in NLP, GANs, style transfer in CV, LM for reranking.

Review the current approaches from Hu et al. and Shen et al.
Input: unpaired two text dataset X = {x_1, ..., x_m}, Y = {y_1, ..., y_n} and their corresponding styles v_x, v_y (can be a label embedding).
Use an encoder E to encode sentence x/y to get content vector z_x (z_y) = E(x, v_x) (E(y, v_y)).
Use an decoder G to generate style-transferred sentence G(z, v). (x, y notation is ignored).
To guarantee z_x and z_y follow the same distribution, assume p(z) follows a prior distribution and add a KL-divergence regularization on z_x, z_y. (Becomes VAE).
However, the posterior distribution of z fails to capture content of a sentence.
To capture the desired styles in generated sentence, Hu et al. additionally use a style classifier on the generated samples, and the decoder G is trained to maximize the accuracy of the style classifier.
Shen et al. use GAN-training to align z distribution.

In equation (1) & (2), train LM with GAN-training.
However, since LM is a structured discriminator, we hope that LM only assign high perplexity for negative (fake) sentence, hence negative samples may not be necessary. They add a weight γ to the loss of negative samples for investigating the necessity. If γ = 0, the LM is simply trained on a real sentence.
Experiment shows that adding negative samples sometimes improve the results. However, empirically that using negative samples makes the training very unstable and the model diverges easily.

Use Gumbel-softmax to approximate the output sentence from G, and then compute cross-entropy loss using LM.
Use weighted average of embedding to LM. (See paper for detail)

Question: Why not simply use policy gradient?

Preliminary experiments show that LM prefers short sentences.
Two tricks are applied:
- Normalize the loss with sentence length.
- Fix the length of generated sentence to be the same of input sentence.

Sep 25 '18 12:09 howardyclo