Open-Sora-Plan
Open-Sora-Plan copied to clipboard
Potential bug of sequential parallel training for i2v models
Hi there,
Thanks for the amazing work!
Recently I tried to finetune the 93x480p_i2v checkpoint with sp_size=8 on 8 NPUs, but I get terrible finetune results with significant artifacts.
However, when I set sp_size=1, the finetuned results look just fine.
Do you have any clues for this issue? I have tried to figure out this issue for a long time.
FYI, I have managed to check that the sequential parallel forward process is okay. It is equivalent to the non sequential parallel forward result. However, when I tried backward, the gradients seem to be different.
If you feel interested, we can communicate with email as well.
Best regards, Jin