An interesting result in v2v mode
Using v2v for expression driving, it was observed that under the same video, the results showed 'exaggerated expressions' (the mouth opens wider or closes less). Shouldn't it be exactly the same as the driving video?
https://github.com/user-attachments/assets/4366169f-fd45-4a69-ab2e-d81a93ee55d3
Thanks for your feedback @iloveOREO
We update the main branch to fix a small bug, and the last frame no longer has pursed lips.
We will update more details about this phenomenon tomorrow for this issue.
Thanks for your feedback. @iloveOREO
If you want the source video and driving video to be the same video, and the animated video to be as similar as possible to the source video, you can use: python inference.py --no_flag_relative_motion --no_flag_do_crop. In this way, you can achieve the following result:
https://github.com/user-attachments/assets/c1e73ee1-e151-41f8-833f-ffcbb2fa3ef8
Here, we are not using relative driving, but absolute driving. The difference between the two is that --flag_relative_motion means that the motion offset of the current driving frame relative to the first driving frame will be added to the motion of the source frame as the final driving motion, while --no_flag_relative_motion means that the motion of the current driving frame will be directly used as the final driving motion.
If you use the default --flag_relative_motion, then when the source frame is a smile, and the driving frame has an expression deformation relative to the first driving frame, the expression of the animated frame will be a smile added to the smile, so the expression will be amplified. The animated video in this setting is as follows:
https://github.com/user-attachments/assets/7d36f137-cca5-4945-935b-242f240e3f56
Thanks for your feedback. @iloveOREO
If you want the source video and driving video to be the same video, and the animated video to be as similar as possible to the source video, you can use:
python inference.py --no_flag_relative_motion --no_flag_do_crop. In this way, you can achieve the following result:d0--d0_concat_non_relative.mp4 Here, we are not using relative driving, but absolute driving. The difference between the two is that
--flag_relative_motionmeans that the motion offset of the current driving frame relative to the first driving frame will be added to the motion of the source frame as the final driving motion, while--no_flag_relative_motionmeans that the motion of the current driving frame will be directly used as the final driving motion.If you use the default
--flag_relative_motion, then when the source frame is a smile, and the driving frame has an expression deformation relative to the first driving frame, the expression of the animated frame will be a smile added to the smile, so the expression will be amplified. The animated video in this setting is as follows:d0--d0_concat_relative.mp4
Thank you for your reply. 'Absolute driving' performs well in this case.
However, I also tried generating with different videos/IDs and found that there is always some jitter when using --no_flag_relative_motion and no relative head rotation(v2v).
https://github.com/user-attachments/assets/b91c2c69-70bb-443f-bfb5-f1b0b0f1ac1e
Initially, I thought this was caused by t_new = x_d_i_info['t'] , so I tried changing it tot_new = x_s_info['t'](sinceR_new = R_s, shouldn't this be the case?), but the results didn't change significantly. Finally, I tried setting t_new = torch.zeros(x_d_i_info['t'].size()).to(device), and found no visible difference in the generated results. So, is the main source of head jitter fromx_d_i_info['exp']?
https://github.com/user-attachments/assets/a53271f8-9eac-44a0-bb3d-764291562a2b
How can real 'absolute driving' be achieved, where only the expression is edited and the original head movement is retained?
Additionally, I noticed that the paper specifically mentionedNote that the transformation differs from the scale orthographic projection, which is formulated as x = s · (x_c + δ)R + t. , Could the current representation be causing instability in the generated results under the driving video due to the inability to fully decouple exp from ```R``?
@iloveOREO sorry for bothering you, but I have same question about that equation; x = s(x_c + \delta)R + t.
Is there any insight about this equation after you questioning? I believe that there is not enough disentanglement between pose and expression (or deformation) parameter due to the nature of equation proposed in the paper. Also there is not enough tools for separating them (e.g. loss term).
All I know about this is just the paragraph in the paper "Note that the transformation differs from the scale orthographic projection~~~, but I cannot find good ablation study or any kind of supplementary materials about that. Is there any experiment for explaining about this problem?
@samsara-ku Sorry, I didn’t make further attempts to achieve a more accurate expression. As you know, there currently doesn’t seem to be a tool that can perfectly separate pose and expression. This part of the work might involve re-modeling head movements, which is beyond my capabilities. If you come across a useful tool someday, please feel free to share it with me!
@iloveOREO Thank you for the kind replying :smile: