papernotes
papernotes copied to clipboard
Unsupervised Data Augmentation for Consistency Training
trafficstars
Metadata
- Authors: Qizhe Xie, Zihang Dai, Eduard Hovy, Minh-Thang Luong, Quoc V. Le
- Organization: Google Brain & CMU
- Paper: https://arxiv.org/abs/1904.12848
- Conference: It seems like a submission to NeurIPS 2019.
- Code: https://github.com/google-research/uda (They didn't release the algorithm...)
TL;DR

- Supervised data augmentation: Current data augmentation method for labeled data provides a steady but limited performance boost when labeled data is usually small.
- Unsupervised data augmentation (UDA):
- Design data augmentation method for unlabeled data since unlabeled data is often larger.
- Consistency loss: Minimize the KL divergence between between the predicted distributions on an unlabeled example and an augmented unlabeled example.
- Consistency/smoothness enforcing: UDA smooths input/hidden space so that model can be more robust.
- Total loss: Supervised loss + Consistency loss
- Allows label information to propagate from labeled data to unlabeled data.
Training Techniques
- Propose Training signal (supervised loss) annealing (TSA) for preventing overfitting on small labeled data: Gradually release supervised loss signal during training with log/linear/exp schedules (exp is recommeded for very limited labeled data).
- Use targeted data augmentation (e.g. AutoAugment) gives a significant improvement over other untargeted data augmentations.
- Diverse and valid augmentations that inject targeted inductive biases are the keys, but there are tradeoffs for generating text, e.g., diverse text may not be a valid sentence.
- Propose (1) Confidence-based masking; (2) Entropy minimization; (3) Softmax temperature controlling to sharpen the unlabeled data predictions (prevent to be over-flat and thus causes the consistency loss useless). (1)+(3) is the most effective.
- Propose Domain-relevance Data Filtering to address the mismatch of class distribution of out-of-domain unlabeled data: Train a in-domain baseline model, predict unlabeled data, and pick out the examples that the model is most confident about (equally distributed among classes).
How to apply it in regression problem?
Results
- 2.7% error rate (w/ 4000 labeled data) on CIFAR-10, nearly matching full dataset performance.
- 2.85% error rate (w/ 250 labeled data) on SVHN, nearly matching full dataset performance.
- 4.2% error rate (w/ 20 labeled data) > SToA model (w/ 25000 labeled data) on IMDb text classification.
- Improves ImageNet top-1/top-5 accuracy from 55.1/77.3% to 68.7%/88.5% (w/ 10% of labeled data)
- Improves ImageNet top-1/top-5 accuracy from 78.3/94.4% to 79.0/94.5% (w/ full labeled data + 1.3M extra unlabeled data)
Notable Related Work
- mixup: Beyond Empirical Risk Minimization by MIT & FAIR (ICLR 2018): Data augmentation from a single data point and performs interpolation of data pairs to achieve augmentation.
- MixMatch: A Holistic Approach to Semi-Supervised Learning by Google Research (2019/05): A concurrent work that unifies several prior works on semi-supervised learning.