papernotes icon indicating copy to clipboard operation
papernotes copied to clipboard

Unsupervised Data Augmentation for Consistency Training

Open howardyclo opened this issue 6 years ago • 1 comments
trafficstars

Metadata

  • Authors: Qizhe Xie, Zihang Dai, Eduard Hovy, Minh-Thang Luong, Quoc V. Le
  • Organization: Google Brain & CMU
  • Paper: https://arxiv.org/abs/1904.12848
  • Conference: It seems like a submission to NeurIPS 2019.
  • Code: https://github.com/google-research/uda (They didn't release the algorithm...)

howardyclo avatar Jul 23 '19 03:07 howardyclo

TL;DR

  • Supervised data augmentation: Current data augmentation method for labeled data provides a steady but limited performance boost when labeled data is usually small.
  • Unsupervised data augmentation (UDA):
    • Design data augmentation method for unlabeled data since unlabeled data is often larger.
    • Consistency loss: Minimize the KL divergence between between the predicted distributions on an unlabeled example and an augmented unlabeled example.
    • Consistency/smoothness enforcing: UDA smooths input/hidden space so that model can be more robust.
    • Total loss: Supervised loss + Consistency loss
    • Allows label information to propagate from labeled data to unlabeled data.

Training Techniques

  • Propose Training signal (supervised loss) annealing (TSA) for preventing overfitting on small labeled data: Gradually release supervised loss signal during training with log/linear/exp schedules (exp is recommeded for very limited labeled data).
  • Use targeted data augmentation (e.g. AutoAugment) gives a significant improvement over other untargeted data augmentations.
  • Diverse and valid augmentations that inject targeted inductive biases are the keys, but there are tradeoffs for generating text, e.g., diverse text may not be a valid sentence.
  • Propose (1) Confidence-based masking; (2) Entropy minimization; (3) Softmax temperature controlling to sharpen the unlabeled data predictions (prevent to be over-flat and thus causes the consistency loss useless). (1)+(3) is the most effective.
  • Propose Domain-relevance Data Filtering to address the mismatch of class distribution of out-of-domain unlabeled data: Train a in-domain baseline model, predict unlabeled data, and pick out the examples that the model is most confident about (equally distributed among classes).

    How to apply it in regression problem?

Results

  • 2.7% error rate (w/ 4000 labeled data) on CIFAR-10, nearly matching full dataset performance.
  • 2.85% error rate (w/ 250 labeled data) on SVHN, nearly matching full dataset performance.
  • 4.2% error rate (w/ 20 labeled data) > SToA model (w/ 25000 labeled data) on IMDb text classification.
  • Improves ImageNet top-1/top-5 accuracy from 55.1/77.3% to 68.7%/88.5% (w/ 10% of labeled data)
  • Improves ImageNet top-1/top-5 accuracy from 78.3/94.4% to 79.0/94.5% (w/ full labeled data + 1.3M extra unlabeled data)

Notable Related Work

howardyclo avatar Jul 23 '19 13:07 howardyclo