Open-Sora-Plan icon indicating copy to clipboard operation
Open-Sora-Plan copied to clipboard

Rewriting DiT/Latte into StableDiffusion3 MMDiT

Open kabachuha opened this issue 4 months ago • 6 comments

See https://github.com/PKU-YuanGroup/Open-Sora-Plan/issues/43 for the diagram

  • [x] Rewriting Transformer Blocks to process both text and latents
  • [x] >Joint Attention tweak
  • [x] > Layer norms, rms norms
  • [x] > Extracting params from the modified timestep embed
  • [x] > Fix params tensors sizes
  • [x] Add text encoders (CLIP and T5)
  • [x] Add pooling of text embeds influencing the timestep embed trough MLP
  • [ ] Test that it works, somehow...

Everyone is welcomed to contribute!

Current Latte/MMDiT diff for readability diff.txt

kabachuha avatar Mar 06 '24 14:03 kabachuha

Great!

qqingzheng avatar Mar 06 '24 14:03 qqingzheng

Also making a mirror PR to OpenDiT https://github.com/NUS-HPC-AI-Lab/OpenDiT/pull/92

kabachuha avatar Mar 07 '24 11:03 kabachuha

@LinB203 @sennnnn, anyone willing to test?

kabachuha avatar Mar 12 '24 20:03 kabachuha

@LinB203 @sennnnn, anyone willing to test?

I will check it in few days.

LinB203 avatar Mar 13 '24 05:03 LinB203

Could you provide a version for un-conditon or class-condition, so that we can test it quickly. Otherwise the cost for testing text2video is so high. Thanks you.

LinB203 avatar Mar 13 '24 06:03 LinB203

Well, the entire point of MMDiT is to process both text and image embeddings at the same time. The cost may not be as high, if the better structure allows for greater quality while reducing the params number. (and with the dataset precaching #136 the experiments will be less risky)

I think a training on webvid's small low-res subset can show whether it works well

kabachuha avatar Mar 13 '24 08:03 kabachuha