[Feature] More Training details for Wan2.2 14B MoE self-forcing distillation
Motivation
Hi there!
Great work on the Wan2.2 14B MoE self-forcing distillation – the results are impressive!
I’m curious if you have any plans to share more training details, especially regarding how you addressed the memory challenges. In particular, I’d love to learn about your solutions to the KV cache backward memory spike issue during training.
Thanks a lot for your contributions and looking forward to your insights!
Related resources
No response
Our current training recipe is not good enough yet, so we're still trying to improve it by trying different settings. But we'll release it once we have converged on a final recipe.
As for the KV cache memory spike, unfortunately we haven't addressed it yet, as we're currently focusing on the quality first. But we do have plans to address it using the trick mentioned in krea's blog: https://www.krea.ai/blog/krea-realtime-14b#
Our current training recipe is not good enough yet, so we're still trying to improve it by trying different settings. But we'll release it once we have converged on a final recipe.
As for the KV cache memory spike, unfortunately we haven't addressed it yet, as we're currently focusing on the quality first. But we do have plans to address it using the trick mentioned in krea's blog: https://www.krea.ai/blog/krea-realtime-14b#
What do you think about the direct forcing trick here? https://arxiv.org/html/2510.01784v2 Have you tried similar approaches?