papernotes
papernotes copied to clipboard
Few-Shot Unsupervised Image-to-Image Translation
Metadata
- Authors: Ming-Yu Liu, Xun Huang, +4 authors, Jan Kautz
- Organization: NVIDIA, Cornell University, Aalto University
- Paper: https://arxiv.org/pdf/1905.01723.pdf
- Code: https://github.com/NVlabs/FUNIT
- Video: https://youtu.be/kgPAqsC8PLM
- Demo: https://nvlabs.github.io/FUNIT/petswap.html
- Demo video: https://youtu.be/JTu-U0C4xEU
Abstract
- Current unsupervised/unpaired image-to-image translation (UIT) methods (see ref) typically requires many images in both source and target classes, which greatly limits their use.
- This paper proposes novel framework that only needs a few examples (few-shot) and can work on unseen target classes.
- The proposed framework can also be applied to few-shot image classification and outperform a SoTA method based on feature hallucination.
Method Overview

- Motivation: Human can imagine the unseen target classes (e.g., seeing a standing tiger for the first time and imagine it lying down) by past visual experiences (e.g., seeing another animal standing and lying down before).
- Past visual experience: Learn on images of many different classes.
- Imagine unseen classes: Translate images from source class to target class with few examples of target class.
- Data: Source class images: Many source classes with each contain many images (e.g., species of animals).
- Training: Use source class images to train a multi-class UIT model (the target class is still from source classes).
- Inference: Few seen/unseen target class images only accessible during inference.
Model
- x¯ = G(x, {y_1, ..., y_K}): A conditional few-shot image generator (translator) takes a content image x and 1-way (class) K-shot images {y_1, ... y_K} as input and generates the output image x¯.
- z_x = E_x(x): A content encoder maps content image x to content latent code z_x.
- z_y = E_y({y_1, ..., y_K}): A class (style) encoder maps {y_1, ... y_K} to latent vectors individually and averages them into a class latent code z_y.
- x¯ = F_x(z_x, z_y): A decoder consisted of several adaptive instance normalization (AdaIN) residual blocks followed by several upscale conv layers.
- By feeding z_y to the decoder via the AdaIN layers, we let the class images control the global look (style), while maintaining the local structure (content).
- The generalization capability depends on the number of source classes during training (more is better).
- D: A multi-task adversarial discriminator.
Training
- |S|: Number of source classes.
- For D, each task determines whether an input image is real or fake of the source class. As there are |S| source classes, we have |S| binary outputs for D.
- Input an real image x of a source class c_x, penalize D if its c_x-th output is fake. However, no penalization for outputting fake for other (|S|-1) source classes.
- Input an fake image x¯ of a source class c_x, penalize D if its c_x-th output is real. Otherwise, penalize G.
Losses

- Overall loss

- GAN loss: As described above.

- Reconstruction loss encourages content similar to image of source class.

- Feature matching loss encourages style similar to images of target class.
- D_f is the feature extractor of the discriminator D without the last layer.
UIT methods with Different Constraints (Enforce translation to preserve certain properties)
- Pixel values
- Pixel gradients
- Semantic features
- Unsupervised cross-domain image generation. ICLR 2017.
- Class labels
- Pairwise sample distances
- One-sided unsupervised domain mapping. NIPS 2017.
- Cycle consistency
- Dualgan: Unsupervised dual learning for image-to-image translation. ICCV 2017.
- Unpaired image-to-image translation using cycle-consistent adversarial networks. ICCV 2017.
- Learning to discover cross-domain relations with generative adversarial networks. ICML 2017.
- Augmented cyclegan: Learning many-to-many mappings from unpaired data. ICML 2018.
- Toward multimodal image-to-image translation. NIPS 2017.
- Shared latent space assumption
- Coupled generative adversarial networks. NIPS 2016.
- Unsupervised image-to-image translation networks. NIPS 2017.
- Partially-shared latent space assumption
- Multimodal unsupervised image-to-image translation (MUNIT). ECCV 2018.
- Diverse image-to-image translation via disentangled representation. ECCV 2018.
- This work.
Related work
- One-shot unsupervised cross domain translation. NIPS 2018: Assume one source class image but many target class images.