stylegan-encoder icon indicating copy to clipboard operation
stylegan-encoder copied to clipboard

Interpolation between 2 faces in dlatent space not as meaningful as it is in qlatent space

Open stas-sl opened this issue 6 years ago • 26 comments

Hi!

First, thanks for your work!

I tried to interpolate between 2 faces in the dlatent space (18, 512) and the result seems to be not as meaningful as it is if interpolating between 2 vectors in the qlatent space (512). It kinda works but some transient images contain strange artifacts or do not look like very valid face. Did you notice this effect? Seems like not all points along the linear path in the dlatent space correspond to real faces, though in the qlatent space they do.

Just wandering if it possible somehow to get latent representations in the original qlatent space to compare interpolation quality.

stas-sl avatar Feb 21 '19 11:02 stas-sl

Hi!

First, thanks for your work!

I tried to interpolate between 2 faces in the dlatent space (18, 512) and the result seems to be not as meaningful as it is if interpolating between 2 vectors in the qlatent space (512). It kinda works but some transient images contain strange artifacts or do not look like very valid face. Did you notice this effect? Seems like not all points along the linear path in the dlatent space correspond to real faces, though in the qlatent space they do.

Hi @stas-sl ! Actually I was able to interpolate.

person_a = # (18, 512)
person_b = # (18, 512)
for c in np.linspace(0, 1, 50):
    generate_image(c*person_a + (1-c)*person_b)

Result: https://giphy.com/gifs/trump-hillary-stylegan-oNPDt7n8KkBlct1SA0

Just wandering if it possible somehow to get latent representations in the original qlatent space to compare interpolation quality.

Yep, that's possible and it works but a lot of details are lost in this case.

For now I'm working on better approach for learning more meaningful latent vectors by using some regularization tricks, which are somehow related to truncation trick. I'm going to commit it this weekend.

Puzer avatar Feb 21 '19 13:02 Puzer

I did a couple of experiments to compare interpolation in different spaces.

First, I used random qlatent vectors and correspondig to them dlatent vectors obtained via the mapping network.

qlatent1 = np.random.randn(512)[None, :]
qlatent2 = np.random.randn(512)[None, :]
dlatent1 = Gs.components.mapping.run(qlatent1, None)
dlatent2 = Gs.components.mapping.run(qlatent2, None)

qlatents = np.vstack([(1 - i) * qlatent1 + i * qlatent2 for i in np.linspace(0, 1, 50)])
dlatents = np.vstack([(1 - i) * dlatent1 + i * dlatent2 for i in np.linspace(0, 1, 50)])
dqlatents = Gs.components.mapping.run(qlatents, None)

dimages = Gs.components.synthesis.run(dlatents)
dqimages = Gs.components.synthesis.run(dqlatents)
qimages = Gs.run(qlatents, None)
  1. first (left) image is dimages obtained via interpolation in dlatent space (8, 512)
  2. second (middle) image is dqimages- they are obtained via interpolation in qlatent space (512), then for each vector calculating corresponding dlatent matrix via mapping network, and then passing it to the synthesis network
  3. third (right) image is qimages - they are obtained via single run of whole network interpolating in qlatent space

Example 1 Example 1

Example 2 Example 2

Obviously there is a difference especially between 1 vs 2/3 images. In the first image (while interpolating in dlatent space) the transition seems to be more straightforward, though in 2/3 images you can get sometimes some other person in the middle of interpolation. I tried different random vectors and looks like both ways (interpolating in qlatents or dlatent spaces) produce quite meaningful faces along the way, though the path may differ.

Another experiment that I did was interpolating between dlantents obtained from images via optimization:

dlatent1 = ... # (8, 512) matrix obtained via optimization from image
dlatent2 = ... # (8, 512) matrix obtained via optimization from another image

dlatents = np.array([(1 - i) * dlatent1 + i * dlatent2 for i in np.linspace(0, 1, 50)])
images = Gs.components.synthesis.run(dlatents)

The results:

Example 3 Example 3

Example 4 Example 4

Example 5 Example 5

Of course it is rather subjective and depends on concrete source and target images and often produce quite reasonable interpolations, the examples above seem to me a bit artificial in the middle of interpolation. Actually it is hard to say whether the reason is interpolation in dlatent space rather than qlatent, or the way how those dlatents were obtained, or maybe I'm just nitpicking :)

stas-sl avatar Feb 23 '19 07:02 stas-sl

Hi stas-sl, would you like to share the code for the matrix obtained via optimization form image?

Thanks

JunaidAsghar avatar Feb 28 '19 11:02 JunaidAsghar

@JunaidAsghar, I actually used encode_images.py script as it is written in the readme

stas-sl avatar Feb 28 '19 12:02 stas-sl

@stas-sl thanks for quick respond. Do you have idea on how to trainthe pereptual model once not everytime on each image.

JunaidAsghar avatar Feb 28 '19 12:02 JunaidAsghar

Only what is written here https://www.reddit.com/r/MachineLearning/comments/anzi1t/d_stylegan_but_in_reverse_is_it_possible/

Some say you might try to train an encoder, while others say that it will not work very well.

stas-sl avatar Feb 28 '19 13:02 stas-sl

@stas-sl

Inspired by this, I trained a model (a slightly modified resnet50) to infer high-scale latent space features from a portrait photo, training the model on thousands of universally unique image-dlatent pairs. This approach may also work on the mid and low scale features as well, but I haven't tested it yet. It doesn't yield the same detail as @Puzer's awesome input optimization trick, but the model outputs vectors that land safely in the dense parts of the latent space, making interpolations more stable. It performs very well for me in transferring face position from a video in real-time. The detection and alignment bit is actually the performance bottleneck that I'm working on now. Here's a video: https://twitter.com/calamardh/status/1102441840752713729

Maybe this approach could be used alongside input optimization for faster results.

  • while I'm here, I'd like to share some other cool results I've found thanks to this repo, rotating old photos and artwork, for instance: https://twitter.com/calamardh/status/1100651381034233862

gradient-dissenter avatar Mar 04 '19 05:03 gradient-dissenter

@gradient-dissenter @stas-sl @tals @sam598 Thanks for your meaningful comments!

My current status:

  1. I'm actually playing with training an actual encoder which can predict dlatent (without optimization trick) - I have two models for now - ResNet50 and MobileNetV2 which perform relatively similar.
  2. Further improvement of optimization of dlatent - first of all we can initialize dlatent using prediction from 1) model. Moreover, we can do more clever trick and use L2 regularization and keep the optimized dlatent vector close to predicted dlatent from 1). It acts like truncation trick, but it gives more meaningful results.
  3. Optimization process itself also was improved. I've changed optimized to Adam and use LR schedules. Good looking results now can be obtained after ~3 sec of optimization (2080 Ti)
  4. Useful comment from @tals, that dlatent from mapping network for different layers actually the same. Now I'm trying to train an encoder from 1) but using mixed dlatent - I suppose it can give even better results.
  5. I also fixed issue with memory leak which @sam598 pointed out, thanks!

Unfortunately I don't have much time for now, but I expect to polish everything up and publish everything this week.

What can really help, but I don't have capacity for now to do so:

  1. Somehow obtain generated images from lower-resolution lods (256\512) - I expect that it can significantly reduce optimization time.
  2. Disentangled latent directions based on TL-GAN great research
  3. More meaningful interpolations based on Latent space oddity: on the curvature of deep generative models research

Puzer avatar Mar 04 '19 11:03 Puzer

  1. I'm actually playing with training an actual encoder which can predict dlatent (without optimization trick) - I have two models for now - ResNet50 and MobileNetV2 which perform relatively similar.

Does this work similarly to the feed-forward style transfer nets? I've been thinking of trying this out, since the optimization-based approach worked well and the problems are similar.

Disentangled latent directions based on TL-GAN great research

Are you looking at this through the prism of finding the latent of a given picture, or finding "interesting" latent directions (facial hair, gender etc)?

Their general approach is so similar to yours! The disentanglement technique would help with the first use case, but not sure how it would help with the latter.

tals avatar Mar 06 '19 08:03 tals

@Puzer any chance that push is still coming?

jcpeterson avatar Mar 18 '19 03:03 jcpeterson

@stas-sl

Inspired by this, I trained a model (a slightly modified resnet50) to infer high-scale latent space features from a portrait photo, training the model on thousands of universally unique image-dlatent pairs. This approach may also work on the mid and low scale features as well.

can you share the model of modified resnet50, am not able to generate the image-dlatent with with guassian distribution.

kohatkk avatar Mar 20 '19 12:03 kohatkk

Hi @Puzer, thank you for this great repo ! Do you plan to publish the work you mentioned in this thread soon ?

@kohatkk here is a code to finetune a resnet

import os
import numpy as np
import pickle
import cv2

import dnnlib
import config
import dnnlib.tflib as tflib

from keras.applications.resnet50 import ResNet50
from keras.applications.imagenet_utils import preprocess_input
from keras.layers import Dense
from keras.models import Sequential, load_model


def load_Gs():
    tflib.init_tf()
    with dnnlib.util.open_url(config.url_ffhq, cache_dir=config.cache_dir) as f:
        _, _, Gs = pickle.load(f)
    return Gs

def finetune_resnet(save_path, image_size=224, batch_size=10000, test_size=1000, n_epochs=10, max_patience=5, seed=0):
    """
    Finetunes a resnet to predict W from X
    Generate batches (X, W) of size 'batch_size', iterates 'n_epochs', and repeat while 'max_patience' is reached
    on the test set. THe model is saved every time a new best test loss is reached.
    :param save_path: str, path to save the model. If already exists, the model will be finetuned.
    :param image_size: int
    :param batch_size: int
    :param test_size: int
    :param n_epochs: int
    :param max_patience: int
    :param seed: int
    :return: None
    """
    assert image_size >= 224

    # Create a test set
    print('Creating test set')
    np.random.seed(seed)
    W_test, X_test = generate_dataset(n=test_size, image_size=image_size)
    X_test = preprocess_input(X_test.astype('float'))

    # Build model
    if os.path.exists(save_path):
        print('Loading existing model')
        model = load_model(save_path)
    else:
        print('Building model')
        resnet = ResNet50(include_top=False, pooling='avg', input_shape=(image_size, image_size, 3))
        model = Sequential()
        model.add(resnet)
        model.add(Dense(512))
        model.compile(loss='mse', metrics=[], optimizer='adam')

    # Iterate on batches of size batch_size
    print('Training model')
    patience = 0
    best_loss = np.inf
    while (patience <= max_patience):
        W_train, X_train = generate_dataset(batch_size)  # Not optimal as we reload Gs everytime
        X_train = preprocess_input(X_train.astype('float'))
        model.fit(X_train, W_train, epochs=n_epochs, verbose=True)
        loss = model.evaluate(X_test, W_test)
        if loss < best_loss:
            print('New best test loss : {:.5f}'.format(loss))
            model.save(save_path)
            patience = 0
            best_loss = loss
        else:
            patience += 1


if __name__ == '__main__':
    # Finetune the resnet
    finetune_resnet('data/finetuned_resnet.h5', batch_size=10000, test_size=1000, max_patience=3, n_epochs=10)

SimJeg avatar May 08 '19 12:05 SimJeg

@SimJeg This looks really interesting; could you also post the code for your generate_dataset() function?

pbaylies avatar May 08 '19 13:05 pbaylies

Here it is !

It's quite quick and dirty as I reload Gs every time I generate a new batch. But time does not really matters here as it converge after a few batches ( = few hours). While it works perfectly for generated images, it does not really work for real world images but faces generated or somehow similar and a good starting point for optimization.


def generate_dataset(n=10000, save_path=None, seed=None, image_size=224, minibatch_size=8):
    """
    Generates a dataset of 'n' images of shape ('size', 'size', 3) with random seed 'seed'
    along with their dlatent vectors W of shape ('n', 512)

    These datasets can serve to train an inverse mapping from X to W as well as explore the latent space

    :param n: int
    :param image_size:  int
    :param seed: int
    :param savepath: str
    :return: numpy arrays of shape(n, 512) and shape(n, size, size, 3)
    """

    Gs = load_Gs()

    if seed is not None:
        Z = np.random.RandomState(seed).randn(n, Gs.input_shape[1])
    else:
        Z = np.random.randn(n, Gs.input_shape[1])
    W = Gs.components.mapping.run(Z, None, minibatch_size=minibatch_size)
    X = Gs.components.synthesis.run(W, randomize_noise=False, minibatch_size=minibatch_size, print_progress=True,
                                    output_transform=dict(func=tflib.convert_images_to_uint8, nchw_to_nhwc=True))
    X = np.array([cv2.resize(x, (image_size, image_size)) for x in X])

    if save_path is not None:
        prefix = '_{}_{}'.format(seed, n)
        np.save(os.path.join(os.path.join(save_path, 'W' + prefix)), W[:, 0])
        np.save(os.path.join(os.path.join(save_path, 'X' + prefix)), X)

    return W[:, 0], X

SimJeg avatar May 08 '19 13:05 SimJeg

@SimJeg Thank you very much! Doesn't that take up a lot of memory, generating that many images at once?

pbaylies avatar May 08 '19 13:05 pbaylies

@SimJeg

it does not really work for real world images

Why do you think that is? Perhaps some random translations of the image by 5-10 pixels before cropping and resizing would help here?

Also, how did you use it as a starting point for optimization? Did you just run the generator.set_dlatents(d_latent) line before optimizing in the encode_image.py script? Can you post the change?

I'm starting to think we should start a fork or new repo at this point so we can all work on improvements at a faster pace. This repo is 3 months old.

@pbaylies I can only fit about 1,250 images into memory at once. A way around this is to load one meta-batch at a time of say 1000 images or so for training, using model.fit(X_train, W_train, epochs=1) in a loop, and evaluating every 10 meta-batches or so.

jcpeterson avatar May 08 '19 13:05 jcpeterson

Ok @SimJeg et al., playing with this over in Google Colab, here's what I've come up with so far -- https://drive.google.com/open?id=1bVk6AKchrNr3u9tv3SxsgttXNCspvF01

pbaylies avatar May 08 '19 22:05 pbaylies

Update: To answer my questions above, setting generator.set_dlatents(d_latent) indeed works and pixel shifting isn't needed as the approximate encodes work fine with out-of-sample images. Using this method and Adam I can get a decent encode in about 12 seconds.

jcpeterson avatar May 09 '19 02:05 jcpeterson

Ok, I'm happy with the performance of the encoding that I'm getting; it quickly converges to get the basics right, and then incrementally improves after that. Code follows.

ResNet StyleGAN Encoder

Much love to @Puzer and @SimJeg on GitHub for all their hard work on this; see:

https://github.com/Puzer/stylegan-encoder/

https://github.com/Puzer/stylegan-encoder/issues/1#issuecomment-490489772

EDIT: there were a few mistakes in this code, better just to go to my repo at this point, now that I have one: https://github.com/pbaylies/stylegan-encoder

pbaylies avatar May 13 '19 16:05 pbaylies

@SimJeg Have you considered using the perceptual loss function of the encoder for your feed-forward network instead of MSE? I expect it to be much slower to train, but it might result in significantly higher image quality.

I'd love to try it myself, but I don't see myself having the time to experiment with it in the near future. That's why I thought I'd share my idea here in case someone else might want to give it a shot.

Edit: "Perceptual Losses for Real-Time Style Transfer and Super-Resolution" (https://arxiv.org/abs/1603.08155) explains how this method can be used to create a feed-forward version of Gatys et al.'s famous Neural Style Transfer, which is also basically an optimization problem trying to minimize perceptual loss.

Vinno97 avatar May 14 '19 07:05 Vinno97

I've been playing with improving the encoder by updating the loss function, as well as using a pre-trained Resnet to provide a starting point for the dlatents; I'll see about forking / making a repo soon with my findings. Contributions welcome! One thing I noticed, addding an L1 loss to the dlatents themselves helps a lot, to keep them in roughly the same range as normal faces in the rest of the model.

pbaylies avatar May 15 '19 16:05 pbaylies

Don't have much time to work on this project but it's great tok ow you had some progress !

To answer a previous I noticed that face recovered using gradient descent have dlatents w of size (18, 512) where the 18 vectors are not that much correlated. It makes sense because as shown in the paper you can mix these 18 vectors to mix styles.

It would make sense training a resnet to predict not only one vector of size 512 but the 18. I made a first try without success...

Changing the loss from mse(w_true, w_pred) to perceptual_loss(stylegan(w_true), stylegan(w_pred)) seems heavy but could be interesting as perceptual_loss proved to be quite efficient !

Good point for l1 loss too ! I don't know if you had a look the dlatents distribution but there look like density (x) = distribution 1 if x < 0 else distribution2 so we could indeed add some prior to amtch such distributions

Le mer. 15 mai 2019 à 18:07, pbaylies [email protected] a écrit :

I've been playing with improving the encoder by updating the loss function, as well as using a pre-trained Resnet to provide a starting point for the dlatents; I'll see about forking / making a repo soon with my findings. Contributions welcome! One thing I noticed, addding an L1 loss to the dlatents themselves helps a lot, to keep them in roughly the same range as normal faces in the rest of the model.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Puzer/stylegan-encoder/issues/1?email_source=notifications&email_token=ADE64VLZMXNMCAPGK2KDALLPVQYNPA5CNFSM4GY42WJ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVPE2KA#issuecomment-492719400, or mute the thread https://github.com/notifications/unsubscribe-auth/ADE64VKD55AHTQ6GATSHWUTPVQYNPANCNFSM4GY42WJQ .

SimJeg avatar May 16 '19 00:05 SimJeg

Hi @SimJeg -- I've started a fork, see here for the resnet training code! Currently I'm mixing up latent values and also using negative truncation for more balance and variation. Thanks for getting me started down this path!

EDIT: The repo is ready to go now, and I've added a link to a pre-trained resnet model as well: https://github.com/pbaylies/stylegan-encoder

pbaylies avatar May 16 '19 00:05 pbaylies

Hi @pbaylies ,have you ever tried to generate higher resolution images such as 512x512 or 1024x1024 ?Can i adjust the image size in train_resnet.py from 256 to 512? I tried but failed,this maybe caused by restrore checkpoint from your sharing pre-trained model.

I want to edit specified human faces on higher resolution ,but the face generated by StyleGAN mostly not the same as original face one .So i doubt if this was caused by image encoder

shartoo avatar Aug 06 '19 03:08 shartoo

Hi @shartoo feel free to raise issues on my repo as well; on the pretrained FFHQ model, images are always generated at 1024 anyhow; you can try training a ResNet from scratch with a different input dimension, that should be fine. In my experience, you can get both quicker and better results by sticking to 256 in the encoder; to do better, you might need a smarter loss function, or you might be running up against the limits of that model in StyleGAN.

pbaylies avatar Aug 06 '19 09:08 pbaylies

@Puzer

@gradient-dissenter @stas-sl @tals @sam598 Thanks for your meaningful comments!

My current status:

  1. I'm actually playing with training an actual encoder which can predict dlatent (without optimization trick) - I have two models for now - ResNet50 and MobileNetV2 which perform relatively similar.
  2. Further improvement of optimization of dlatent - first of all we can initialize dlatent using prediction from 1) model. Moreover, we can do more clever trick and use L2 regularization and keep the optimized dlatent vector close to predicted dlatent from 1). It acts like truncation trick, but it gives more meaningful results.
  3. Optimization process itself also was improved. I've changed optimized to Adam and use LR schedules. Good looking results now can be obtained after ~3 sec of optimization (2080 Ti)
  4. Useful comment from @tals, that dlatent from mapping network for different layers actually the same. Now I'm trying to train an encoder from 1) but using mixed dlatent - I suppose it can give even better results.
  5. I also fixed issue with memory leak which @sam598 pointed out, thanks!

Unfortunately I don't have much time for now, but I expect to polish everything up and publish everything this week.

What can really help, but I don't have capacity for now to do so:

  1. Somehow obtain generated images from lower-resolution lods (256\512) - I expect that it can significantly reduce optimization time.
  2. Disentangled latent directions based on TL-GAN great research
  3. More meaningful interpolations based on Latent space oddity: on the curvature of deep generative models research

how use LR schedules?thanks

jiesonshan avatar Mar 05 '20 09:03 jiesonshan