first-order-model icon indicating copy to clipboard operation
first-order-model copied to clipboard

Does not support high-resolution images

Open ghost opened this issue 4 years ago • 57 comments

Is there a way to support high resolution

ghost avatar Mar 15 '20 02:03 ghost

  1. The only reliable methods is to retrain on high resolution videos.
  2. You can also try to use of the shell video super-resolution method.
  3. Since all the networks are fully convolutional you can actually try to use pretrained checkpoints , trained on 256 images. In order to do this change the size in https://github.com/AliaksandrSiarohin/first-order-model/blob/2ed57e0e7825717a966ea9eca95e7abd61edd78f/demo.py#L121 to size that you want. Also it may be benificial to change scale_factor parameter in config https://github.com/AliaksandrSiarohin/first-order-model/blob/2ed57e0e7825717a966ea9eca95e7abd61edd78f/config/vox-256.yaml#L26 and in https://github.com/AliaksandrSiarohin/first-order-model/blob/2ed57e0e7825717a966ea9eca95e7abd61edd78f/config/vox-256.yaml#L38. For example if you want 512 resolution images change it to 0.125, so that input resolution for these networks is always 64.

If you have any lack with these please share your findings.

AliaksandrSiarohin avatar Mar 15 '20 04:03 AliaksandrSiarohin

@AliaksandrSiarohin thanks for the feedback.

Notice however that point 3 doesn't work out-of-the-box. If I change the scale factors as you mention I get an error for incompatible shapes.

Also, as I'm planning to try out some super-resolution methods for this, I'm curious about what you mean with "shell video super-resolution method"?

5agado avatar Mar 18 '20 15:03 5agado

Can you post the error message you got? I mean some video super resolution method, like one there https://paperswithcode.com/task/video-super-resolution

AliaksandrSiarohin avatar Mar 18 '20 16:03 AliaksandrSiarohin

@AliaksandrSiarohin

Error(s) in loading state_dict for OcclusionAwareGenerator:
	size mismatch for dense_motion_network.down.weight: copying a param with shape torch.Size([3, 1, 13, 13]) from checkpoint, the shape in current model is torch.Size([3, 1, 29, 29]).

5agado avatar Mar 18 '20 18:03 5agado

Ah yes you are right. Can you try in https://github.com/AliaksandrSiarohin/first-order-model/blob/2ed57e0e7825717a966ea9eca95e7abd61edd78f/modules/util.py#L205 to hard set sigma=1.5?

AliaksandrSiarohin avatar Mar 18 '20 18:03 AliaksandrSiarohin

Cool, that worked! Could it be generalized for other resolutions? I'll do some tests and comparisons using super-resolution

5agado avatar Mar 18 '20 18:03 5agado

What do you mean? Generalized?

AliaksandrSiarohin avatar Mar 18 '20 19:03 AliaksandrSiarohin

Is the scale factor proportional to image size? Like if I wanted to try with 1024x1024 I should use scale_factor = 0.0625?

Also is the fixed sigma (1.5) valid only for size 512? What about for size 1024?

I was interested in generalizing my setup such that these values can be derived automatically by the given image size.

5agado avatar Mar 18 '20 19:03 5agado

Yes you should use scale_factor = 0.0625. In other words kp_detector and dense_motion should always operate on the same 64x64 resolution. This sigma is parameter of anti-aliasing for downsampling, in principle any could be used, I select the one which is used by default in scikit-image. So sigma=1.5 is default for 256x256. But I don't think it affect results that much. So you can leave it equal to 1.5 or you can avoid loading this dense_motion_network.down.weight parameter, by removing it from state_dict.

AliaksandrSiarohin avatar Mar 18 '20 19:03 AliaksandrSiarohin

Thanks so much for the support, really valuable info here!

5agado avatar Mar 19 '20 15:03 5agado

Hi ,have you retrained on high resolution videos? If i do not retrain on new datasets, instead just do as the point3 mentioned, can I get a good result?

CarolinGao avatar Mar 24 '20 09:03 CarolinGao

See https://github.com/tg-bomze/Face-Image-Motion-Model for point2

AliaksandrSiarohin avatar Apr 02 '20 06:04 AliaksandrSiarohin

@AliaksandrSiarohin @5agado I have run some tests using the method detailed in point 2.

Generally the result looks like this:

ezgif-1-3f05db10770d

It would be good to get your thoughts on whether this an issue of using a checkpoint trained on 256 x 256 images, or if I am doing something wrong...

Many thanks for your excellent work.

LopsidedJoaw avatar Apr 08 '20 15:04 LopsidedJoaw

@AliaksandrSiarohin

sigma=1.5 does not work for 1024x1024 source images (with scale factor of 0.0625). I get the following error:

  File "C:\Users\admin\git\first-order-model\modules\util.py", line 180, in forward
    out = torch.cat([out, skip], dim=1)
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 1. Got 1 and 2 in dimension 2 at c:\a\w\1\s\tmp_conda_3.6_061433\conda\conda-bld\pytorch_1544163532679\work\aten\src\thc\generic/THCTensorMath.cu:83

But I can confirm that hard coding sigma=1.5 works only for 512x512 images (with scale factor of 0.125).

Can you please let us know the correct setting for 1024x1024 images? Thank you for your wonderful work.

pidginred avatar Apr 11 '20 16:04 pidginred

@pidginred can you provide full stack trace, and your configs.

AliaksandrSiarohin avatar Apr 11 '20 22:04 AliaksandrSiarohin

@AliaksandrSiarohin Certainly! Here are the changes I made (for 1024x1024 / 0.0625) & the full error stack:

Diffs

diff --git a/config/vox-256.yaml b/config/vox-256.yaml
index abfe9a2..10fce42 100644
--- a/config/vox-256.yaml
+++ b/config/vox-256.yaml
@@ -23,7 +23,7 @@ model_params:
      temperature: 0.1
      block_expansion: 32
      max_features: 1024
-     scale_factor: 0.25
+     scale_factor: 0.0625
      num_blocks: 5
   generator_params:
     block_expansion: 64
@@ -35,7 +35,7 @@ model_params:
       block_expansion: 64
       max_features: 1024
       num_blocks: 5
-      scale_factor: 0.25
+      scale_factor: 0.0625
   discriminator_params:
     scales: [1]
     block_expansion: 32
diff --git a/demo.py b/demo.py
index 848b3df..28bea70 100644
--- a/demo.py
+++ b/demo.py
@@ -134,7 +134,7 @@ if __name__ == "__main__":
     reader.close()
     driving_video = imageio.mimread(opt.driving_video, memtest=False)
 
-    source_image = resize(source_image, (256, 256))[..., :3]
+    source_image = resize(source_image, (1024, 1024))[..., :3]
     driving_video = [resize(frame, (256, 256))[..., :3] for frame in driving_video]
     generator, kp_detector = load_checkpoints(config_path=opt.config, checkpoint_path=opt.checkpoint, cpu=opt.cpu)
 
diff --git a/modules/util.py b/modules/util.py
index 8ec1d25..cb8b149 100644
--- a/modules/util.py
+++ b/modules/util.py
@@ -202,7 +202,7 @@ class AntiAliasInterpolation2d(nn.Module):
     """
     def __init__(self, channels, scale):
         super(AntiAliasInterpolation2d, self).__init__()
-        sigma = (1 / scale - 1) / 2
+        sigma = 1.5 # Hard coded as per issues/20#issuecomment-600784060
         kernel_size = 2 * round(sigma * 4) + 1
         self.ka = kernel_size // 2
         self.kb = self.ka - 1 if kernel_size % 2 == 0 else self.ka

Full Errors

(base) C:\Users\admin\git\first-order-model-1024>python demo.py  --config config/vox-256.yaml --driving_video driving.mp4 --source_image source.jpg --checkpoint "C:\Users\admin\Downloads\vox-cpk.pth.tar" --relative --adapt_scale
demo.py:27: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  config = yaml.load(f)
Traceback (most recent call last):
  File "demo.py", line 150, in <module>
    predictions = make_animation(source_image, driving_video, generator, kp_detector, relative=opt.relative, adapt_movement_scale=opt.adapt_scale, cpu=opt.cpu)
  File "demo.py", line 65, in make_animation
    kp_driving_initial = kp_detector(driving[:, :, 0])
  File "C:\Users\admin\Anaconda3\lib\site-packages\torch\nn\modules\module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "C:\Users\admin\Anaconda3\lib\site-packages\torch\nn\parallel\data_parallel.py", line 141, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "C:\Users\admin\Anaconda3\lib\site-packages\torch\nn\modules\module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "C:\Users\admin\git\first-order-model-1024\modules\keypoint_detector.py", line 53, in forward
    feature_map = self.predictor(x)
  File "C:\Users\admin\Anaconda3\lib\site-packages\torch\nn\modules\module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "C:\Users\admin\git\first-order-model-1024\modules\util.py", line 196, in forward
    return self.decoder(self.encoder(x))
  File "C:\Users\admin\Anaconda3\lib\site-packages\torch\nn\modules\module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "C:\Users\admin\git\first-order-model-1024\modules\util.py", line 180, in forward
    out = torch.cat([out, skip], dim=1)
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 1. Got 1 and 2 in dimension 2 at c:\a\w\1\s\tmp_conda_3.6_061433\conda\conda-bld\pytorch_1544163532679\work\aten\src\thc\generic/THCTensorMath.cu:83

pidginred avatar Apr 12 '20 13:04 pidginred

@pidginred fixed sigma worked on my side for any resolution, including 1024x1024. it's not the reason of your problems.

eps696 avatar Apr 12 '20 15:04 eps696

@eps696 What was your scale factor for 1024x1024? And did you get a proper output?

pidginred avatar Apr 12 '20 15:04 pidginred

@pidginred same as yours, 0.0625. but i also resize driving_video, not only source_image (which i see you don't).

eps696 avatar Apr 12 '20 15:04 eps696

@eps696 Confirmed that worked. However, I lost almost complete eye & mouth tracking (compared to 256x256), and it results in lots of weird artifacts and very poor quality output.

Are you getting good quality results (in terms of animation) using 1024x1024 compared to 256x256?

pidginred avatar Apr 12 '20 16:04 pidginred

@pidginred i've used it for rather artistic purposes (applying to face-alike imagery), so cannot confirm 100%. it definitely behaved very similar with 1024 and 256 resolutions, though. speaking animation quality, quite a lot was said here about the necessity of having similarity in poses (or face expressions) between the source image and the starting video frame. i think you may want to check that first.

eps696 avatar Apr 12 '20 16:04 eps696

@AliaksandrSiarohin @5agado I have run some tests using the method detailed in point 2.

Generally the result looks like this:

ezgif-1-3f05db10770d

It would be good to get your thoughts on whether this an issue of using a checkpoint trained on 256 x 256 images, or if I am doing something wrong...

Many thanks for your excellent work.

I had the same problem

zpeiguo avatar Apr 16 '20 07:04 zpeiguo

@eps696 Can you share the revised file?After I followed the above steps, the facial movements were normal, but the mouth could not open.

zpeiguo avatar Apr 16 '20 07:04 zpeiguo

@zpeiguo that project is unreleased yet, sorry. and this topic is about high res images. check other issues for 'normality' of movements.

eps696 avatar Apr 16 '20 07:04 eps696

@eps696 Can you share the revised file? After I followed the above steps, the facial movements were normal, but the mouth could not open.

Same here. Mouth won't open. I believe that the best is to retrain everything with a 512 rez

shillerz avatar Apr 16 '20 14:04 shillerz

@eps696 Confirmed that worked. However, I lost almost complete eye & mouth tracking (compared to 256x256), and it results in lots of weird artifacts and very poor quality output.

Are you getting good quality results (in terms of animation) using 1024x1024 compared to 256x256?

I have also tested with the third method with 512, the animation quality is lower than the 256. I have to judgement as to why, I expect the quality to be the same with the same 64 keypoints.

boraturant avatar Jun 20 '20 20:06 boraturant

I got method 3 working on Windows 10 following the steps above and successfully output a 512 version. However, the results are of much lower quality animation wise. Hoping we can get a 512 or higher checkpoint trained soon.

BloodBlackNothingness avatar Jun 21 '20 00:06 BloodBlackNothingness

I got method 3 working on Windows 10 following the steps above and successfully output a 512 version. However, the results are of much lower quality animation wise. Hoping we can get a 512 or higher checkpoint trained soon.

I also followed method 3 and the animation is not acceptable :-( Mouth does not open at all and the face is distorted all the time. Maybe have to use AI to upscale 256 to 512 video :-)

bigboss97 avatar Jun 25 '20 05:06 bigboss97

I got method 3 working on Windows 10 following the steps above and successfully output a 512 version. However, the results are of much lower quality animation wise. Hoping we can get a 512 or higher checkpoint trained soon.

I also followed method 3 and the animation is not acceptable :-( Mouth does not open at all and the face is distorted all the time. Maybe have to use AI to upscale 256 to 512 video :-)

Yes in theory. It depends on the video output quality I suppose. I have tried with Topaz Labs software and it also enhances distortions.

BloodBlackNothingness avatar Jun 26 '20 00:06 BloodBlackNothingness

@AliaksandrSiarohin @5agado I have run some tests using the method detailed in point 2.

Generally the result looks like this:

ezgif-1-3f05db10770d

It would be good to get your thoughts on whether this an issue of using a checkpoint trained on 256 x 256 images, or if I am doing something wrong...

Many thanks for your excellent work.

Which super resolution network did you end up using? :)

lschaupp avatar Oct 17 '20 22:10 lschaupp