MaskControl How to make the controlled motion to be shorter?

Thanks for the great work!

By default the control motion length is 196. I tried to edit it to shorter but there will be an error in https://github.com/exitudio/MaskControl/blob/b530ecb58fb64222630023801a42f77e30177b18/models/mask_transformer/control_transformer.py#L478

ctrlNet_cond = (global_joint - _pred_motions_denorm) * global_joint_mask.unsqueeze(-1)
RuntimeError: The size of tensor a (80) must match the size of tensor b (196) at non-singleton dimension 1

Oct 01 '25 00:10 DocterStrawberry

Thank you for your interest in our work. The tensor global_joint always has the shape of [batch, 196, 22, 3]. If you want to generate a shorter sequence, you can set m_length to the desired length and then pad the rest of global_joint with zeros for the remaining frames.

For example from the generation, can be something like

m_length = torch.tensor([80]).cuda()

k=0
global_joint = torch.zeros((m_length.shape[0], 196, 22, 3), device=m_length.device)
global_joint[k, :, 0] = traj1
global_joint[k, :, 20] = traj2
global_joint[k, 80:] = 0

Oct 02 '25 00:10 exitudio

Thank you for your prompt reply! It is very helpful.

I would like to know if it's possible to have separate clip_text for different timelines? Also different text for different body parts? For example, in the first 30 let the person walk, and in the next 30 frames, I want the person to run.

Because I just found that if I use a long sentence to describe a compositional spatio control signal, the results are not good. Specifically, I give the network a start pose and an end pose, along with a hand trajectory, but it seems the network can not do smooth in-betweening and hand trajectory simultaneously. The trajectory following is good, but the motion itself is broken.

Also, if I put a hand trajectory in as control, the other body parts may start to jitter a lot randomly. For example below I put clip_text = ['A person picks something from the table with right hand while keeping the lower body still.'], and the hand trajectory in the middle of the global_joints the person will walk to the hand trajectory then starts to jitter:

generation.html generation1.html

How can I make the body more static and only move the hand?

Oct 13 '25 00:10 DocterStrawberry

First, let me clearify that HumanML3D represents relative joint positions, so directly concatenating multiple motion clips can lead to inconsistent global positions. And here are the answers to your questions:

Separate text prompts for different timelines This can be done without MaskControl, simply generate individual motions from multiple text prompts using Masked Motion Model and then concatenate them (with transition tokens), as described in Section 4. Motion Editing of MMM.

Different body parts timeline control We can generate full-body motions for each text prompt separately, then generate a motion from an empty text while conditioning on the body-part timelines from those generated motions. Conceptually, it’s like controlling motion generation from a blank canvas. Note that this control uses relative positions, since the global positions of each generated motion can differ. Therefore, the Logits Regularizer cannot be used in this case (or would require retraining to support relative positions). More details are provided in Supplementary Section A.2 of our paper.
Broken or unnatural motion This issue likely comes from out-of-distribution behavior. The HumanML3D dataset is still relatively small, so certain descriptions (e.g., “while keeping the lower body still”) may not be well-represented. It helps to refer to the HumanML3D training set and apply controls aligned with it. Also, mismatches between the control signal and the generated motion. For example, control joints being too far apart can cause foot sliding.

Oct 13 '25 22:10 exitudio

THank you so much for your detailed instructions! This is really helpul and inspiring.

(1) Thank you, this totally makes sense (2) Here I was actually curious about the Body Part Timeline Control figures you posted on your project page. I am not sure if I understand correctly about the section A.2: for example, if we want the upper body to do postures of giving a speech while the lower body is running. Then we can generate two full-body motion clips using two different prompts. In the next stage, we can use MaskControl and give the network an empty text, and take the lower body from running motion, upper body motion from the speech motion as control signal input, and the network outputs a naturally fused motion by in-betweening. (Let's assume they have the same pelvis position already aligned by pre-processing.) And we don't need to re-train Logits Regularizer if we use this strategy? (3) Thank you. I will refactor my text prompt by learning from this.

Oct 18 '25 16:10 DocterStrawberry