first-order-model icon indicating copy to clipboard operation
first-order-model copied to clipboard

A few basic question on implementation detail.

Open mazzzystar opened this issue 4 years ago • 58 comments

Thanks for your work ! Here are some questions:

  • About formula 4 and your implementation. image Is this right for each kind of color of variabel match with the original formula ? In my undestanding, the references frameR means the origin point of K dim space in virtual coordinates, and z denotes "the space coordinate of D", but in your code it's a standard meshgrid, the identity_grid.

  • About HourGlass I don't see the HourGlass module in your paper, but in your code https://github.com/AliaksandrSiarohin/first-order-model/blob/902f83a4217c75c842e1f536b3331c5032b703a2/modules/dense_motion.py#L15

What does it stand for ? Cause we alrerady have a) the warp module b) the occlusion module.

I will appreciate for your answer ~

mazzzystar avatar Apr 16 '20 13:04 mazzzystar

  1. Colors are right.
  2. I need to compute the source coordinate for each driving coordinate (z). So identity grid is just contain all the driving coordinates in a grid, e.g [-1,1]x[-1,1].
  3. Hourglass is just a type of architecture. In paper it is called Unet.

AliaksandrSiarohin avatar Apr 16 '20 13:04 AliaksandrSiarohin

Thanks for quick response, I will check the code part of z.

After carefully read your paper, I still cannot understand why use the reference frame R. What's the benefit of using S<-R<-D rather than computing directly S<-D ? Sorry for the stupid question.

mazzzystar avatar Apr 16 '20 14:04 mazzzystar

The lack of paired data make this approach not meaningful. So imagine you train a network to predict a transformation S<-D from [S ¦¦ D], S concatenated with D. At training time you can only use frames from the same video, while at test time you will need to use frames from different videos. The network will likely never generalize to frames from different videos, since it never saw any.

To this end we trying to learn a network that make an independent predictions for S and D. And to define it properly we introduce this R.

AliaksandrSiarohin avatar Apr 16 '20 18:04 AliaksandrSiarohin

If the purpose of introducing R is to prevent "train on same video, test on different", can we just use random [S, D] pair which from different video to train network, without using R ?

Also, currently your model still train D and S from the same video( if I'm not wrong), therefore how to prevent the situation you mentioned ? https://github.com/AliaksandrSiarohin/first-order-model/blob/902f83a4217c75c842e1f536b3331c5032b703a2/frames_dataset.py#L110-L114

mazzzystar avatar Apr 17 '20 01:04 mazzzystar

No this is not the purpose. The purpose is to make independent motion predictions for S, D. If the motion predictions is dependent on each other, e.g. Keypoint predictor use a concatenation of S and D, it won't generalise.

AliaksandrSiarohin avatar Apr 17 '20 03:04 AliaksandrSiarohin

"I need to compute the source coordinate for each driving coordinate (z). So identity grid is just contain all the driving coordinates in a grid, e.g [-1,1]x[-1,1]."

Here I still can't understand the difference between driving coordinate (z) and kp_driving. I think kp_driving means the relative coordinate of K keypoints from R->D, and the driving coordinate means the local pixels around each z(k).

If so, why we can reprersent z with an identity_grid ?

mazzzystar avatar Apr 17 '20 03:04 mazzzystar

Yes true z is local pixel around z_k. However at the point where I produce this sparse motions, I don't know what will be the neiboirhoods. So I compute the transformation for all the coordinates in driving, and select the neiboirhoods afterwards. Note that all possible coordinates in the driving frame that we potentially may need for warping can be produced by identity_grid.

AliaksandrSiarohin avatar Apr 17 '20 03:04 AliaksandrSiarohin

And, if the purpose is tto make independent motion predictions for S, D. Then I think we should optimize D and S independently, such as we optimize (R, S) and (R, D).

But in your code, the R is always used as an intermediate variable for S<-D. we do not add restrict to R, so how can we make sure D and S trained independently ?

mazzzystar avatar Apr 17 '20 03:04 mazzzystar

And, if the purpose is tto make independent motion predictions for S, D. Then I think we should optimize D and S independently, such as we optimize (R, S) and (R, D).

But in your code, the R is always used as an intermediate variable for S<-D. we do not add restrict to R, so how can we make sure D and S trained independently ?

Later only motion information from driving is used. So prediction will be independed of the driving appearance.

Guess it will be easier if you say how you want to implement it, and I will say why this would not work.

AliaksandrSiarohin avatar Apr 17 '20 03:04 AliaksandrSiarohin

OK, for example:

1.Select random frame D0 and S0 which are similar from different videos. 2.For each frame of Di videos, compute keypoints and heatmap, and the sparse motion between(D0, Di). 3.Predict new S from (S0, spare motion) with Di motion and S0 appearance.

Currently I havn't totally understand your paper, so maybe I've gone to a wrong direction.

mazzzystar avatar Apr 17 '20 04:04 mazzzystar

Yes, this would be ideal training scheme. Note that for step 3 you need a ground truth for S. So the training data required is videos (S and D) where 2 object perform exactly the same movements, and it is not possible to find it from in the wild videos.

AliaksandrSiarohin avatar Apr 17 '20 04:04 AliaksandrSiarohin

But why I need full S video? I only use S0 in the whole training phase.

You mean the ground truth S for reconstruction loss, which is the same appearance with S0 and same motion with Di ?

mazzzystar avatar Apr 17 '20 04:04 mazzzystar

But why I need full S video? I only use S0 in the whole training phase.

The network is trained with reconstruction loss, so you will need to compute

|| S - \hat(S)||

Where \hat(S) is ground truth video.

AliaksandrSiarohin avatar Apr 17 '20 04:04 AliaksandrSiarohin

Now I understand. But What if I can warp S0 with sparse motion between D0 and Di in step 2 ? So the warped new S0 can be the ground truth.
Its the similar idea with yours, but without the use of R.

mazzzystar avatar Apr 17 '20 04:04 mazzzystar

What do you mean by align?

AliaksandrSiarohin avatar Apr 17 '20 04:04 AliaksandrSiarohin

sorry, it should be "warp"

mazzzystar avatar Apr 17 '20 04:04 mazzzystar

In that case what will be your training signal for obtaining sparse motions between D0 and Di, in the first place?

AliaksandrSiarohin avatar Apr 17 '20 04:04 AliaksandrSiarohin

I think the sparse motion can be direrctly computed from (kp_Di, kp_D0, D0) ?

mazzzystar avatar Apr 17 '20 04:04 mazzzystar

Yes, but how to obtain kp_Di and Kp_D0?

AliaksandrSiarohin avatar Apr 17 '20 04:04 AliaksandrSiarohin

With pretrained keypoint detecor?

mazzzystar avatar Apr 17 '20 04:04 mazzzystar

Yes, but the whole purpose of the paper is to avoid it. To be able to train on arbitrary objects.

AliaksandrSiarohin avatar Apr 17 '20 04:04 AliaksandrSiarohin

Though until now I didn't recognized that keypoint training is associated with motion module. Before now I thought keypoint training is an independent part.

mazzzystar avatar Apr 17 '20 04:04 mazzzystar

So here is the purpose why use R: because we do not have a labeled keypoint dataset, So we need to "assume" there exist a standard frame R for each pair image (D, S), which is the origin point of a K-dim space . Then we can train Keypoint Detector and Motion Module in an unsupervised method.

Currently I have a good labeled keypoint dataset and pretrained model, can I replace Keypoint Detector with this pretrained model ? What's the advantage of a learned keypoint in an unsupervised way ?

mazzzystar avatar Apr 17 '20 04:04 mazzzystar

  1. More or less. Not sure why it is K-dim, it more like K two-dimensional coordinate systems.

  2. Sure you can replace. Most of the times supervised key-points works better. Unsupervised key-points however can describe the movements that you may forget to annotate. Plus unsupervised keypoints can encode more staff per keypoint, for faces people usually use 68 supervised keypoints. While here I use 10.

AliaksandrSiarohin avatar Apr 17 '20 04:04 AliaksandrSiarohin

Thanks a lot. I've finished reading your paper but with a lot questions. Will explore the code for more precise understanding.

Sorry for wasting so much of your time, still appreciate your kindly explanation.

mazzzystar avatar Apr 17 '20 05:04 mazzzystar

I think there are some mistake in your video link , which you reverse the picture of training and testing.

This shoud be training, image

and this should be testing. image

Am I right ?

mazzzystar avatar Apr 17 '20 06:04 mazzzystar

Yes.

AliaksandrSiarohin avatar Apr 17 '20 06:04 AliaksandrSiarohin

Hi, why you learn the jacobian rather than compute directly. is this because that the transformation from D<-R and S<-R remains unknown yet ? https://github.com/AliaksandrSiarohin/first-order-model/blob/6e18130da4e7931bc423afc85b4ea9f985a938c2/modules/keypoint_detector.py#L25 https://github.com/AliaksandrSiarohin/first-order-model/blob/6e18130da4e7931bc423afc85b4ea9f985a938c2/modules/keypoint_detector.py#L64

mazzzystar avatar Apr 21 '20 08:04 mazzzystar

How you can compute it directly?

AliaksandrSiarohin avatar Apr 21 '20 08:04 AliaksandrSiarohin

No I'm not saying we "can" compute. It just a intuition that jacobian is computed rather than learned.. In this model, we don't know the exact function f(x) of the transformation from D<-R and S<-R, because your "learn" from network, so you then learn jacobian too, right ?

mazzzystar avatar Apr 21 '20 08:04 mazzzystar