Moore-AnimateAnyone
Moore-AnimateAnyone copied to clipboard
About the future works
In the scope of human-related video generation, there are two main and emergent problems, namely, Talking Face Generation (TFG) and Human Animation Generation (HAG). The discrepancy between those problems is what inputs we feed into the models (I assume that models here are Diffusion-based):
- For TFG, it is audio + image/video
- For HAG, it is pose + image/video.
Hence, I wonder there are any studies now adopt an approach to merge two problems into one? If not, what are the obstacles now? (Data, Modeling, ...)