ATVGnet
ATVGnet copied to clipboard
basic infomation about chinese
Thanks for sharing your code.I ran the Chinese audio file with your demo, and my lips were not coordinated. Is there any solution? Does your model plan to train on the Chinese lip dataset?Thanks.
@lelechen63 in lrw_data.py
, what's the difference between generating_landmark_lips
function and generating_demo_landmark_lips
function? One landmark_path is landmark1d
, the other is landmark3d
. But when training the atnet
model, it uses self.lmark_root_path = '../dataset/landmark1d'
. I hope that you can explain it. Thanks.
@lelechen63 Could it be understood that these two functions are two methods for extracting landmarks, and in demo.py
are selected landmark1d
.
@lelechen63 I am a bit confused about landmarks. Does this parameter distinguish between training and testing? Is the PCA the same? U_lrw1.npy belongs to the training set, does the test set also contain a U_lrw1_test.npy? When I looked at the source code, I found that both training and testing used U_lrw1.py.Thanks.
Thanks for sharing your code.I ran the Chinese audio file with your demo, and my lips were not coordinated. Is there any solution? Does your model plan to train on the Chinese lip dataset?Thanks.
The released model is trained on English, but it can be tested on any other language. The reason is that we consider the audio input as 0.04 seconds, which is not sensitive to linguistic(semantic) information about the language type.
@lelechen63 in
lrw_data.py
, what's the difference betweengenerating_landmark_lips
function andgenerating_demo_landmark_lips
function? One landmark_path islandmark1d
, the other islandmark3d
. But when training theatnet
model, it usesself.lmark_root_path = '../dataset/landmark1d'
. I hope that you can explain it. Thanks.
I will clean the code again this month. I will notify you once I finished it. The main process for the landmark is 2 steps: 1. align the image using affine transformation 2. detect landmark. In the original code, we have the third step: normalize the landmark. But actually we do not need this step.
@lelechen63 I am a bit confused about landmarks. Does this parameter distinguish between training and testing? Is the PCA the same? U_lrw1.npy belongs to the training set, does the test set also contain a U_lrw1_test.npy? When I looked at the source code, I found that both training and testing used U_lrw1.py.Thanks.
The PCA for train and test are same. PCA parameters are extracted from train set and can be used for any videos including test set or videos in the wild.
Thanks for sharing your code.I ran the Chinese audio file with your demo, and my lips were not coordinated. Is there any solution? Does your model plan to train on the Chinese lip dataset?Thanks.
The released model is trained on English, but it can be tested on any other language. The reason is that we consider the audio input as 0.04 seconds, which is not sensitive to linguistic(semantic) information about the language type.
Regarding your answer, can I understand that the audio is not the same as the lips, in fact, it has nothing to do with the training language, is it related to the model itself?
@lelechen63 Could you release the parameters of the training AT-net and VG-net?Otherwise, it is difficult for us to achieve the effect in the paper.Thanks.
@lelechen63What are the meanings of new_16_full_gt_train.pkl
and region_16_wrap_gt_train2.pkl
, can you explain? lrw_data.py
is not very clear .Thanks.
@lelechen63What are the meanings of
new_16_full_gt_train.pkl
andregion_16_wrap_gt_train2.pkl
, can you explain?lrw_data.py
is not very clear .Thanks.
臣附议,,,, @lelechen63
Thanks for sharing your code.I ran the Chinese audio file with your demo, and my lips were not coordinated. Is there any solution? Does your model plan to train on the Chinese lip dataset?Thanks.
The released model is trained on English, but it can be tested on any other language. The reason is that we consider the audio input as 0.04 seconds, which is not sensitive to linguistic(semantic) information about the language type.
what is the mean of 0.04
?winlen or winstep of ``mfcc?
why face normalization is no need? From my point of view, individual face shape are not different, which also contain rotations(raw, yaw, pitch). All of this parameters are not relevant to audio inputs. So I'm wondering why normalization is no need? Hope for your reply^