Question about the input of ctrlnet
Hi, dear author Thanks for your excellent work! When I read the code, I noticed that the input of ctrlnet(self.seqTransEncoder_control) and backbone(self.seqTransEncoder) is same.
https://github.com/exitudio/MaskControl/blob/d3ef3580057fb5fcf22fd7c7484ef8bad69478df/models/mask_transformer/control_transformer.py#L335C1-L350C85
It's not same as the picture Fig 2.(c) in your paper. Specifically, both the ctrlnet and the backbone receive the identical variable, named “control_input”, and this variable is not modified between its use as input for the two networks. Thus, they process the exact same input data.
Could you please clarify the reason for this implementation choice? Looking forward to your reply and clarification!
Best regards.
Hi, Thanks for pointing that out! Yes, the control shouldn’t go to the main (frozen) network. I actually found this after the submission and tried removing it, but the results were pretty much the same. So I decided to leave it as is to keep consistency with all the numbers reported in the paper.
Hi, Thanks for pointing that out! Yes, the control shouldn’t go to the main (frozen) network. I actually found this after the submission and tried removing it, but the results were pretty much the same. So I decided to leave it as is to keep consistency with all the numbers reported in the paper.
Thank you so much for your reply! I have learned a great deal from it.
I do have a few follow-up questions, and I would be very grateful if you could offer some further insights.
-
I noticed that the CtrlNet loss weights in the paper were set to 0.1 * {CE_loss} + 0.9 * Joints_loss. However, in the latest commit, they appear to be a 0.5 : 0.5 ratio. Could I please ask about the reasoning behind this modification?
-
Furthermore, I encountered a significant overfitting issue when attempting to integrate CtrlNet and the logits optimization on different mask generative backbones. The accuracy on the training set is consistently high, but the performance on the validation set is poor. Do you have any suggestions or advice regarding this?
I look forward to your reply!
- I use .1xent + .9joint_loss for "pelvis only". And .5xent+.9joint_loss for "all joints"
- I would suggest evaluating only ControlNet to see how it perform. Since Logits Optimization has no training. Also, Logits Optimization can improve the control error but may hurt the FID score because it can be optimized out of distribution.
- I use .1_xent + .9_joint_loss for "pelvis only". And .5_xent+.9_joint_loss for "all joints"
- I would suggest evaluating only ControlNet to see how it perform. Since Logits Optimization has no training. Also, Logits Optimization can improve the control error but may hurt the FID score because it can be optimized out of distribution.
Dear author,
First of all, thank you for your previous response—it was very helpful and I learned a lot from it.
After further studying the code and paper, I still have two points that I’m unclear about. I would be very grateful if you could kindly share some insights:
-
I noticed that MaskControl adopts an all-token cross-entropy approach to train the backbone, while the original Momask does not. Could you explain the main advantages of this design choice? Is it intended to improve certain aspects of performance or training stability?
-
According to the ablation study, when using ControlNet or logits optimization alone, the motion quality metrics (e.g., FID, RTOP3, foot skating) degrade compared to the “no control” setting. However, when both are used together, the model achieves the best performance. I find this phenomenon quite puzzling—why would two individually underperforming modules combine to produce such improved results? If you could share your interpretation or point me to relevant literature that explains such synergy, I would really appreciate it.
Looking forward to your reply. Thank you again for your inspiring work and time!
Best regards!