Open-Sora-Plan Potential bug of sequential parallel training for i2v models

Potential bug of sequential parallel training for i2v models

Open jinjinw opened this issue 5 months ago • 0 comments

Hi there,

Thanks for the amazing work!

Recently I tried to finetune the 93x480p_i2v checkpoint with sp_size=8 on 8 NPUs, but I get terrible finetune results with significant artifacts.

However, when I set sp_size=1, the finetuned results look just fine.

Do you have any clues for this issue? I have tried to figure out this issue for a long time.

FYI, I have managed to check that the sequential parallel forward process is okay. It is equivalent to the non sequential parallel forward result. However, when I tried backward, the gradients seem to be different.

If you feel interested, we can communicate with email as well.

Best regards, Jin

Sep 19 '24 14:09 jinjinw

Open-Sora-Plan Open-Sora-Plan copied to clipboard

Potential bug of sequential parallel training for i2v models

Open-Sora-Plan
Open-Sora-Plan copied to clipboard