eg3d icon indicating copy to clipboard operation
eg3d copied to clipboard

Details in EG3D Inversion

Open oneThousand1000 opened this issue 2 years ago • 48 comments

I released my EG3D inversion code for your reference, you can find it here: EG3D-projector.


Thanks for the impressive work!

As you mentioned in the paper, you use Pivotal Tuning Inversion to invert test images. The PTI finetunes the EG3D parameters based on the pivot latent code, which is obtained by optimization. The pivot latent code is "w" or "w+", however, they are correlated with camera parameters that are fed to the mapping network. Will the novel-view synthesis be affected by this camera-fixed latent code?

I also noticed that you set a hyper-parameter entangle = 'camera' in gen_videos.py, it seems that you have considered this issue when rendering different views for a specific latent code. I tried the 'condition' and 'both', the camera parameters that are input to the mapping only control some unrelated semantic attribute (expression, clothes...). I think maybe the [zs,c] that is fed to the mapping network can be regarded as a latent code with a shape of [1,512+25], does the c influence the camera view of subsequent synthesis?

Now I have reproduced the PTI inversion of EG3D. Please see the video below. I input the re-aligned 00000.png and its camera parameters (from the dataset.json), then I optimize the latent code 'w' and use it as the pivot to finetune eg3d.

The result looks a little strange. I want to know if my implementation is consistent with yours!


I think I've figured out why the camera parameters that are input to the mapping network can't control the camera view, please refer to Generator Pose Conditioning.

https://user-images.githubusercontent.com/32099648/173868809-4cd3fc8b-774b-4068-865e-358f60e19411.mp4

oneThousand1000 avatar Jun 15 '22 15:06 oneThousand1000

Is it possible to share the code for the PTI inversion?

cantonioupao avatar Jun 16 '22 10:06 cantonioupao

What's about your results for the first step (inversion), do you just use lpips loss, following PTI ?

zhangqianhui avatar Jun 18 '22 16:06 zhangqianhui

What's about your results for the first step (inversion), do you just use lpips loss, following PTI ?

Yes, my code is based on the w projector of PTI. It seems that the inversion works best on portraits that look straight ahead. I think I achieved the same performance as the authors'.

https://user-images.githubusercontent.com/32099648/174448309-4196294d-5263-4abf-a7ed-4a1e32ade48d.MOV

oneThousand1000 avatar Jun 18 '22 16:06 oneThousand1000

Ok, great!

zhangqianhui avatar Jun 19 '22 06:06 zhangqianhui

What's about your results for the first step (inversion), do you just use lpips loss, following PTI ?

Yes, my code is based on the w projector of PTI. It seems that the inversion works best on portraits that look straight ahead. I think I achieved the same performance as the authors'.

IMG_0604.MOV

Hi,

I try to inverse this portrait, but is seems that it can't invert a correct eye glasses shape, this can also produce reseanable result in input view. I want to ask if it is because you get the eye glasses after the pivot optimization(which I can't achieve), then this shape will perserve on generator optimization? Or you add other regularization?

Thank your for your time!

https://user-images.githubusercontent.com/46376580/174758112-52a91f58-3c74-4a2c-8d5a-e4a7c94aedfe.mp4

jiaxinxie97 avatar Jun 21 '22 08:06 jiaxinxie97

@jiaxinxie97 Hi jiaxinxie97, I use the original eg3d checkpoints to generate video for the latent code, it seems that the eye glasses is reconstructed successfully, which indicates that I got the eye glasses before the pivot optimization.

I think you can check your projector code, I used this one (both w and w_plus are OK). The zip file I uploaded contains the input re-aligned image and the input camera parameters, you can check if they are consistent with yours.

video generated by the original eg3d checkpoints https://user-images.githubusercontent.com/32099648/174772757-d316bc1d-de52-49a4-a863-6de166000450.mp4

input re-aligned image and the input camera parameters: 01457.zip

oneThousand1000 avatar Jun 21 '22 09:06 oneThousand1000

Thanks! I also use PTI repo, but it is strange I can't reconstruct eye glasses using w or w+ space optimization, I will check! Since the original eg3d checkpoint do not have named_buffers(), so I removed the reg_loss, will it affect the results?

jiaxinxie97 avatar Jun 21 '22 10:06 jiaxinxie97

Thanks! I also use PTI repo, but it is strange I can't reconstruct eye glasses using w or w+ space optimization, I will check! Since the original eg3d checkpoint do not have named_buffers(), so I removed the reg_loss, will it affect the results?

Hi, named_buffers() is an attribute of the synthesis network in StyleGAN2, you can find it in the StyleGAN2Backbone (self.backbone) of the TriPlaneGenerator.

Try to use G.backbone.synthesis.named_buffers() instead of G.named_buffers(), and add the reg_loss.

oneThousand1000 avatar Jun 21 '22 10:06 oneThousand1000

G.backbone.synthesis.named_buffers() Hi, Thank you! I got reasonable result for the eye glasses.

jiaxinxie97 avatar Jun 21 '22 22:06 jiaxinxie97

Hi, @oneThousand1000

Did you set both the z and c as the trainable parameters during the GAN inversion? I guess fixing the c (which can be obtained from the dataset.json) and only inverting the z is more reasonable. What do you think?

e4s2022 avatar Jun 22 '22 06:06 e4s2022

Hi, @oneThousand1000

Did you set both the z and c as the trainable parameters during the GAN inversion? I guess fixing the c (which can be obtained from the dataset.json) and only inverting the z is more reasonable. What do you think?

I set w or w_plus as trainable parameters and fix the c.

oneThousand1000 avatar Jun 22 '22 06:06 oneThousand1000

Got it, thanks for your reply.

BTW, did you follow the FFHQ preprocessing steps in EG3D (i.e., realign to 1500 from in-the-wild images and then resize into 512), or directly use the well-aligned 1024 FFHQ image and just resize into 512?

e4s2022 avatar Jun 22 '22 06:06 e4s2022

Hi @oneThousand1000,

Do you have any out of domain results? I tried PTI by myself on FFHQ checkpoint, it works well on joker, but failed on celeba-HQ dataset. celeba_out joker_out

mlnyang avatar Jun 23 '22 03:06 mlnyang

follow the FFHQ preprocessing steps in EG3D

Got it, thanks for your reply.

BTW, did you follow the FFHQ preprocessing steps in EG3D (i.e., realign to 1500 from in-the-wild images and then resize into 512), or directly use the well-aligned 1024 FFHQ image and just resize into 512?

I followed the FFHQ preprocessing steps in EG3D.

oneThousand1000 avatar Jun 23 '22 03:06 oneThousand1000

@mlnyang, I got the similar results as yours.

I use the well-aligned & cropped FFHQ images (in 1024 resolution), then I resize into 512 to do the subsequent PTI inversion. To be more specific, say, I choose the "00999.png" as the input. Since the camera parameters (25 = 4x4 + 3x3) are provided in dataset.json, so I directly use it. The camara parameters are fixed while the w latent code is trainable. The following are my results: image

However, when I follow the FFHQ preprocessing steps in EG3D which basically contain (1) aligning & cropping in-the-wild image to 1500 size; (2) re-aligning to 1024 & center cropping to 700; (3) resizing to 512, the results seem good: image

I guess the different underlying preprocessings might be the reason. When you tried PTI on joker, how did you preprocess?

e4s2022 avatar Jun 23 '22 03:06 e4s2022

Hi @bd20222 , Thanks for sharing your work.

I think that's the main reason. Actually, the joker is originally came from PTI repo, maybe it was already preprocessed joker image on FFHQ.

mlnyang avatar Jun 23 '22 04:06 mlnyang

I took a look at the PTI aligning script, it seems the same as the original FFHQ.

I inspected the EG3D preprocessing and compared it with the original FFHQ. AFAIK, there is no center cropping step in the original FFHQ preprocessing, so you will find the faces used in EG3D show some vertical translation. I guess the well-trained EG3D model has captured this pattern, resulting in the blurry PTI inversion and the subsequent synthetic novel views seem to be a mixture of two faces.

It's interesting why the joker example works well.

e4s2022 avatar Jun 23 '22 04:06 e4s2022

Hi, please follow the "Preparing datasets" in reademe to get realigned images. According to https://github.com/NVlabs/eg3d/issues/16#issuecomment-1151563364, the original ffhq dataset is not work for the camera parameters of dataset.json, you should predict the camera parameters of original ffhq by yourself.

oneThousand1000 avatar Jun 23 '22 04:06 oneThousand1000

@oneThousand1000

Yuh, I agree. For those who want to directly use FFHQ well-aligned 1024 images, you have to predict the camera parameters by Deep3DFace_pytorch by yourself. But I haven't tested on the EG3D pre-trained model.

e4s2022 avatar Jun 23 '22 04:06 e4s2022

@oneThousand1000

Yuh, I agree. For those who want to directly use FFHQ well-aligned 1024 images, you have to predict the camera parameters by Deep3DFace_pytorch by yourself.

You can email the author and ask for the pose extraction code. Or refer to https://github.com/NVlabs/eg3d/issues/18

oneThousand1000 avatar Jun 23 '22 04:06 oneThousand1000

Oh I see.. center cropping was the problem. I just tried other examples in PTI and it didn't worked. It is strange why the joker image works well. Thanks for your help!! :)

mlnyang avatar Jun 23 '22 04:06 mlnyang

@oneThousand1000 Do you use the noise regularization loss in the first GAN inversion step ?

zhangqianhui avatar Jun 23 '22 08:06 zhangqianhui

@oneThousand1000 Do you use the noise regularization loss in the first GAN inversion step ?

See https://github.com/NVlabs/eg3d/issues/28#issuecomment-1161560077

oneThousand1000 avatar Jun 23 '22 08:06 oneThousand1000

Thanks

zhangqianhui avatar Jun 23 '22 10:06 zhangqianhui

@oneThousand1000 I still have one question. Will be tuned for the parameters of triplane-decoder ? I used another 3D-GAN model (StyleSDF) which doesn't have the triplane generator, and I found the finetune of MLP parameters have harmed the geometry.

zhangqianhui avatar Jun 24 '22 05:06 zhangqianhui

Hi, @oneThousand1000,

I tried to use PTI to get pivot of an image, then in the gen_video.py file, I used the pivot to set zs, which original is set by random seeds: "zs = torch.from_numpy(np.stack([np.random.RandomState(seed).randn(G.z_dim) for seed in all_seeds])).to(device)". I got the video, but the image is totally different with the previous. For the connection between PTI and EG3D, did I miss anything? I noticed you said " I optimize the latent code 'w' and use it as the pivot to finetune eg3d." Do you mean we need generate the dataset and call "train.py" to finetune? If we have done finetune, why do we need PTI? I thought PTI is to get the latent as conditioning for EG3D. Thanks for your help.

BiboGao avatar Jun 29 '22 05:06 BiboGao

Hi, @oneThousand1000,

I tried to use PTI to get pivot of an image, then in the gen_video.py file, I used the pivot to set zs, which original is set by random seeds: "zs = torch.from_numpy(np.stack([np.random.RandomState(seed).randn(G.z_dim) for seed in all_seeds])).to(device)". I got the video, but the image is totally different with the previous. For the connection between PTI and EG3D, did I miss anything? I noticed you said " I optimize the latent code 'w' and use it as the pivot to finetune eg3d." Do you mean we need generate the dataset and call "train.py" to finetune? If we have done finetune, why do we need PTI? I thought PTI is to get the latent as conditioning for EG3D. Thanks for your help.

Hi, you need to feed the zs into mapping network of eg3d, get the w or ws latent code, then optimize the w or ws. Please refer to https://github.com/danielroich/PTI/tree/main/training/projectors or the paper of stylegan for the w/ws latent code definition.

oneThousand1000 avatar Jun 29 '22 06:06 oneThousand1000

I got it, thanks.

BiboGao avatar Jun 29 '22 20:06 BiboGao

Thanks for your help, I also have got realistic results

zhangqianhui avatar Jun 30 '22 08:06 zhangqianhui

@bd20222, hello! Which dataset.json do you use? I use ffhq-dataset-v2.json, but there's no camera parameters.

lyx0208 avatar Jul 04 '22 04:07 lyx0208

@lyx0208, hi, you have to preprocess the dataset in advance. The details can be found here. For your question, the camera parameters provided by the author can be downloaded from https://github.com/NVlabs/eg3d/blob/71ef469df0095c609b2b151127774ea74a1bf17c/dataset_preprocessing/ffhq/runme.py#L48-L50

e4s2022 avatar Jul 04 '22 04:07 e4s2022

@bd20222, get it, thanks!

lyx0208 avatar Jul 04 '22 09:07 lyx0208

FYI We added additional scripts that can preprocess in-the-wild images compatible with the FFHQ checkpoints. Hope that is useful. https://github.com/NVlabs/eg3d/issues/18#issuecomment-1200366872

luminohope avatar Jul 31 '22 07:07 luminohope

FYI We added additional scripts that can preprocess in-the-wild images compatible with the FFHQ checkpoints. Hope that is useful. #18 (comment)

Hi! I found that all the faces in FFHQ Processed Data (download from the google drive link that you provided) are rotated so that two eyes are on a horizontal line. But the uploaded scripts seems to do no rotation. Does this matter?

The first image I uploaded is the image in FFHQ Processed Data, I processed the raw image of 00000 using your uploaded scripts and got the second image.

I also find that the uploaded scripts outputs camera parameters that are different from dataset.json. Maybe it is caused by the missing rotation?

The camera parameter that predicted by uploaded scripts: [ 0.944381833076477, -0.011193417012691498, 0.32866042852401733, -0.828210463398311, -0.010220649652183056, -0.9999367594718933, -0.004687247332185507, 0.005099154064645238, 0.32869213819503784, 0.0010674281511455774, -0.9444364905357361, 2.5698329570120664, 0.0, 0.0, 0.0, 1.0, 4.2647, 0.0, 0.5, 0.0, 4.2647, 0.5, 0.0, 0.0, 1.0 ] The camera parameter in dataset.json: [ 0.9422833919525146, 0.034289587289094925, 0.3330560326576233, -0.8367999667889383, 0.03984849900007248, -0.9991570711135864, -0.009871904738247395, 0.017018394869192363, 0.33243677020072937, 0.022573914378881454, -0.9428553581237793, 2.566997504832856, 0.0, 0.0, 0.0, 1.0, 4.2647, 0.0, 0.5, 0.0, 4.2647, 0.5, 0.0, 0.0, 1.0 ]

img00000000 00000

oneThousand1000 avatar Jul 31 '22 08:07 oneThousand1000

Hi! I think the [image align code](https://github.com/Puzer/stylegan-encoder/blob/master/align_images.py) that provided by [stylegan-encoder](https://github.com/Puzer/stylegan-encoder) may be useful to rotate the image, if you want to get the similar re-aligned image in FFHQ Processed Data.

Maybe for in-the-wild images, the rotation is an unnecessary step.

I modified the code, used it to rotate and crop the image, it seems that after rotation the resulting image is consistent with the one in FFHQ Processed Data.

oneThousand1000 avatar Jul 31 '22 09:07 oneThousand1000