Open-Sora-Plan
Open-Sora-Plan copied to clipboard
Rewriting DiT/Latte into StableDiffusion3 MMDiT
See https://github.com/PKU-YuanGroup/Open-Sora-Plan/issues/43 for the diagram
- [x] Rewriting Transformer Blocks to process both text and latents
- [x] >Joint Attention tweak
- [x] > Layer norms, rms norms
- [x] > Extracting params from the modified timestep embed
- [x] > Fix params tensors sizes
- [x] Add text encoders (CLIP and T5)
- [x] Add pooling of text embeds influencing the timestep embed trough MLP
- [ ] Test that it works, somehow...
Everyone is welcomed to contribute!
Current Latte/MMDiT diff for readability diff.txt
Great!
Also making a mirror PR to OpenDiT https://github.com/NUS-HPC-AI-Lab/OpenDiT/pull/92
@LinB203 @sennnnn, anyone willing to test?
@LinB203 @sennnnn, anyone willing to test?
I will check it in few days.
Could you provide a version for un-conditon or class-condition, so that we can test it quickly. Otherwise the cost for testing text2video is so high. Thanks you.
Well, the entire point of MMDiT is to process both text and image embeddings at the same time. The cost may not be as high, if the better structure allows for greater quality while reducing the params number. (and with the dataset precaching #136 the experiments will be less risky)
I think a training on webvid's small low-res subset can show whether it works well