GeneFacePlusPlus icon indicating copy to clipboard operation
GeneFacePlusPlus copied to clipboard

Hello, I have a few questions I would like to ask you

Open lymGitHub0123 opened this issue 1 year ago • 11 comments
trafficstars

Hello, I have a few questions I would like to ask you:

1、The preset audio2motion_vae is trained on an English dataset. If I need a model for a Chinese dataset, can you provide one? If not, how should I go about training audio2motion with Chinese data? 2、In the trained motion2video_nerf, there are head and torso. The head is easy to understand, as it can be driven by voice for head/mouth movements. In practical scenarios, where is the torso applied? 3、I followed the command "python inference/genefacepp_infer.py --a2m_ckpt=checkpoints/audio2motion_vae --head_ckpt= --torso_ckpt=checkpoints/motion2video_nerf/may_torso --drv_aud=data/raw/val_wavs/May.wav --out_name=may_demo.mp4". Why this demo using torso rather than head?

lymGitHub0123 avatar Feb 20 '24 08:02 lymGitHub0123

你好,我查看了所有的issue,对于问题1,作者说会在六月之前开放audio2motion的训练代码 对于问题2,我也想了解原因,论文中有一定的篇幅进行解释,但是我并没有深刻的理解 对于问题3,我理解为torso是在head的基础上进行训练的,二者并不是独立分开的,而torso相当于最终训练出的模型

刚接触这个项目不久,如果你有更深刻的理解,欢迎探讨呀

Net-Maker avatar Feb 22 '24 06:02 Net-Maker

你好,我查看了所有的issue,对于问题1,作者说会在六月之前开放audio2motion的训练代码 对于问题2,我也想了解原因,论文中有一定的篇幅进行解释,但是我并没有深刻的理解 对于问题3,我理解为torso是在head的基础上进行训练的,二者并不是独立分开的,而torso相当于最终训练出的模型

刚接触这个项目不久,如果你有更深刻的理解,欢迎探讨呀


感谢你的回复,在这两天学习的过程中,对于问题1我了解到其实这个英文模型,一定层度上也是可以用作中文的,只是效果会差些。 对于问题2和3,期望作者的答复。

lymGitHub0123 avatar Feb 22 '24 07:02 lymGitHub0123

image 你好,我想这一段解释了问题2和3,躯干的刚性形变可以一定程度上解决失真的问题

Net-Maker avatar Feb 22 '24 08:02 Net-Maker

image 你好,我想这一段解释了问题2和3,躯干的刚性形变可以一定程度上解决失真的问题


从这篇论文看的结论看,实际直接使用torso-NeRF的效果会好于使用head-NeRF(毕竟torso是基于head来二次训练的),不知你是否了解若同时指定两个模型 --head_ckpt= --torso_ckpt= 会是怎么样的情况呢

lymGitHub0123 avatar Feb 22 '24 12:02 lymGitHub0123

会忽略head的模型参数

Net-Maker avatar Feb 23 '24 03:02 Net-Maker

会忽略头部的模型参数

好,我查看了所有的问题,对于训练问题1,作者说会在六月之前开放audio2motion的代码, 对于问题2,我也想了解原因,论文中有一定的篇幅进行解释,但是我并没有深刻的体会对于问题3的理解 ,我理解为躯干是在头部的基础上进行训练的,两者并不是独立分开的,而躯干实际上是最终训练出的模型 刚接触这个项目不久,如果你有更深入的理解,欢迎讨论呀

您好,有一些关于环境安装的问题,请问您有时间吗

you can submit an issue and lots of people(include me) would help

Net-Maker avatar Feb 23 '24 04:02 Net-Maker

@Net-Maker 请教一下,在该项目训练过程中,建议是有个3min+的视频进行训练。 1、那么这个素材上有什么特殊的要求吗?是尽可能的多说不同的单词文字/展示不同的表情,还是简单的随意说一段3分钟的话? 2、素材视频是否直接拍摄512x512分辨率(含人脸)

lymGitHub0123 avatar Mar 04 '24 08:03 lymGitHub0123

@Net-Maker 请教一下,在该项目训练过程中,建议是有个3min+的视频进行训练。 1、那么这个素材上有什么特殊的要求吗?是尽可能的多说不同的单词文字/展示不同的表情,还是简单的随意说一段3分钟的话? 2、素材视频是否直接拍摄512x512分辨率(含人脸)

对于问题1,素材要求肯定是多说话,然后头部要一直面对镜头,不能太偏,就自然地说话就好了 对于问题2,512x512是对于头部和肩膀的分辨率,按照我的理解头部占整个视频的比例还是挺小的,所以原始分辨率视你自己情况

Net-Maker avatar Mar 05 '24 02:03 Net-Maker

@Net-Maker I've encountered several issues during the training process and would like some advice:

1、At the beginning of training, the log shows every single step, for example: 1000 step [02:36, 14.61 steps/s, ......] 1001 step ...... But after a while, it only outputs logs like these: 04/13 09:24:57 PM Delete ckpt: model_ckpt_steps_2400.ckpt 04/13 09:24:57 PM Delete ckpt: model_ckpt_steps_2200.ckpt What could be the reason for this? Does it have any impact? 2、After how many steps does the training of the may_head start training may_torso? How many steps do both eventually need to be trained for? 3、How can I process the inferred video to only keep the leftmost part and remove the two sections at the bottom, as illustrated in the image below: image

4、My GPU has only 8GB of VRAM. Are there any optimization strategies that can be implemented to accommodate longer video inferencing sessions? In my tests, memory allocation tends to fail when the session exceeds a certain length, for instance, beyond 10 seconds. Is there a way to introduce memory deallocation after certain steps to free up resources?

lymGitHub0123 avatar Apr 14 '24 03:04 lymGitHub0123

@Net-Maker I've encountered several issues during the training process and would like some advice:

1、At the beginning of training, the log shows every single step, for example: 1000 step [02:36, 14.61 steps/s, ......] 1001 step ...... But after a while, it only outputs logs like these: 04/13 09:24:57 PM Delete ckpt: model_ckpt_steps_2400.ckpt 04/13 09:24:57 PM Delete ckpt: model_ckpt_steps_2200.ckpt What could be the reason for this? Does it have any impact? 2、After how many steps does the training of the may_head start training may_torso? How many steps do both eventually need to be trained for? 3、How can I process the inferred video to only keep the leftmost part and remove the two sections at the bottom, as illustrated in the image below: image

4、My GPU has only 8GB of VRAM. Are there any optimization strategies that can be implemented to accommodate longer video inferencing sessions? In my tests, memory allocation tends to fail when the session exceeds a certain length, for instance, beyond 10 seconds. Is there a way to introduce memory deallocation after certain steps to free up resources?

Glad to see you start your training! for question.1: it's ok, the program would only print detailed debug information for the first 10000 steps. for question.2: head training and torso training is desperate, if you want to train torso model, see docs/train.md(I forget the name) for question.3: remove the '--debug' option in your inference command. for question.4: maybe you can cut your videos to infer, but this method also needs to make sure everything is aligned, I'm sorry I cannot help too much. Feel free to ask for more question :)

Net-Maker avatar Apr 15 '24 02:04 Net-Maker

@Net-Maker I've encountered several issues during the training process and would like some advice: 1、At the beginning of training, the log shows every single step, for example: 1000 step [02:36, 14.61 steps/s, ......] 1001 step ...... But after a while, it only outputs logs like these: 04/13 09:24:57 PM Delete ckpt: model_ckpt_steps_2400.ckpt 04/13 09:24:57 PM Delete ckpt: model_ckpt_steps_2200.ckpt What could be the reason for this? Does it have any impact? 2、After how many steps does the training of the may_head start training may_torso? How many steps do both eventually need to be trained for? 3、How can I process the inferred video to only keep the leftmost part and remove the two sections at the bottom, as illustrated in the image below: image 4、My GPU has only 8GB of VRAM. Are there any optimization strategies that can be implemented to accommodate longer video inferencing sessions? In my tests, memory allocation tends to fail when the session exceeds a certain length, for instance, beyond 10 seconds. Is there a way to introduce memory deallocation after certain steps to free up resources?

Glad to see you start your training! for question.1: it's ok, the program would only print detailed debug information for the first 10000 steps. for question.2: head training and torso training is desperate, if you want to train torso model, see docs/train.md(I forget the name) for question.3: remove the '--debug' option in your inference command. for question.4: maybe you can cut your videos to infer, but this method also needs to make sure everything is aligned, I'm sorry I cannot help too much. Feel free to ask for more question :)

@Net-Maker thks so much for the reply. After some verification and source code analysis, for the low memory scenario mentioned above, adding the "-- low_memory_usage" parameter in the inference section can effectively increase the inference time and avoid the occurrence of "not enough memory"。 However, during the training and reasoning process, I still have some problems and confusions,

  1. Does the video used for training first necessarily include audio? For example, if it is a silent video, will it affect the results of the training model?
  2. In the process of reasoning video, if encountering pronunciation that requires a wide mouth, such as "ah", the resulting image retains the display of teeth merging together in the original face, and the mouth is not wide, how should this be solved

lymGitHub0123 avatar Apr 18 '24 10:04 lymGitHub0123