how to do pose transfer for animating a character based on video having animation?
how to do pose transfer for animating a character based on video having animation?
- Inputting first/last/random frames of a character, and fill other frames with poses.
- Inputting a full sequence of poses, and another character picture to ref-image.
hi,I followed the instructions https://github.com/ali-vilab/VACE/issues/28 and still couldn't get the expected results,The generated video character's face shape is inconsistent with the reference unchanged。The detailed steps are:
Input the reference video and get the pose video through the pose task; Input the reference image and get the mask video through the frameref task. out videoThe first frame is black and the other frames are white; Use the reference image to replace the first frame of the pose video, and the other frames remain unchanged Use the pose video with the reference image as src_video and the mask video as src_mask.
First and foremost, there is a tool call Unianimate. It is built with pose transfer in mind and ONLY does pose transfer. Although VACE supports pose transfer out of the box, it is not the sole task it is meant to do, thus it lacks some specialized design that may improve the quality of pose transfer. If your pipeline only needs pose transfer and nothing else, you would probably want to look deeper into Unianimate and use that instead. If for whatever reason you need to use VACE for pose transfer, below are my observations and some best practices I concluded with my experience on this specific task:
VACE's pose reference works pretty much like the good old ControlNet, where it tries to align the output pose one to one, i.e. if you run the output video through openpose again you'll probably get a perfect PIXEL TO PIXEL reconstruction of the control video back (assuming openpose is perfect of course, which it is not, but the problem is not on openpose anyways). What that means is that if the character you are trying to reference does not have a body proportion that is close enough to the driving pose, the output will appear squished or stretched, depending on the situation. Unianimate combats this by introducing an implicit motion guider and intentionally made the training pose and reference image unaligned (which is where it shines compare to other solutions), hence it is able to handle amorphous characters (characters that does not have normal human body proportions or straight out have missing or extra body parts) much better. Without further fine tuning (which sounds great considering how well VACE already is) VACE is inherently not going to perform as well as Unianimate in edge cases, basically meaning that if you are not animating human characters with normal body proportions, you are bound to see stretched or squished characters. At this point, it should be clear why VACE cannot get some of the cases right. If you just cannot get the output to look good no matter what you tried (I've even tried to use inpainting to transfer the pose, but of course the output is garbage), then it probably means you are using the wrong tool in the first place. But before you give up, I think there are some things that you might want to consider before swapping tools (basically my experience on using VACE for pose transfer):
- Always try to use driving pose that matches the body proportion of your character: If you are directly generating the driving video through animation software, match your rig with your character to ensure best quality. If not, choose those video where the performer has roughly the same body proportion as your character (same torso and limb length, roughly the same neck-to-eye distance and eye-to-ear distance). Also, if you are directly animating an image (giving it a random frame) instead of reference generation (where your control video is just the pose video and the masks are all white), make sure that the character in the image matches the pose of the driving video in the frame you insert the input image, or else it will probably struggle to follow the pose (around that frame).
- Use reference generation instead of extension with pose guidance: Just as what I said in 1., since the input image is kept intact, if the pose in the input image is different from the pose in the driving video, VACE will try to interpolate between the pose in the input image and the pose in the driving video (a few frames earlier/later), hence you'll not get a one to one copy of your driving pose. Unless you really want to keep the background the same and does not want to do multiple passes, what you should do is first generate your character performing the action, then mask out the background of that video and inpaint it with the background you desired (which VACE can also do natively).
- Prompt matters: Although VACE can probably do most of the guess work itself, prompting will guide it better towards the desired outcome. If you found that your character is wearing different outfit, looses or gains certain body parts, or part of it just straight out wouldn't move despite the driving pose told it to, it means that VACE probably did not realize those parts are part of your character. Try including some description about the hair color, what the character is wearing, or if there are any decoration or features on the character that should be there in the prompt, and most of the time VACE will do a better job (which contradicts some other opinions I know, but I found that as long as your prompt is accurate and concise, it usually makes the output better). Also, if the character does something weird (like arms noclipping through head or head twisted in some weird angles), describe the pose. Providing short descriptions about what the character should do in the video should alleviate the problem. However, do note that Don't Overprompt It, misaligned prompt and control signals is a recipe for gibberish output.
- About limbs or body parts not in the driving video: I found that when the driving video does not feature feet and legs but the reference image contains those, VACE sometimes decides to map the thigh in the driving video to the whole lower limb of the character, which is not desirable. I would suggest to cut out the part of the image that contains the parts of the body the driving video does not contain and do not prompt about those part whatsoever to minimize the chance of getting this kind of output. If you are dealing with characters that just does not have limbs or certain body parts, masking out the parts that the character does not have in the driving video might work, but it's up to you to test that out.
- It's all about tradeoff: I'm not pretty sure whether there is a knob to control the strength of the control signal (I mainly use ComfyUI for their memory optimization; I am more on the GPU poor side), but if there is, lowering it might maintain the character's appearance better, but at the cost of misaligned pose. If you're OK with the pose being slightly off (which should be in most cases), try look for a sweet spot where both the appearance and the pose are preserved the most.
That's all I have to offer for now. If anyone can add their experience (especially dealing with characters that are not human), it is much appreciated.