Open-Sora-Plan Rewriting DiT/Latte into StableDiffusion3 MMDiT

Rewriting DiT/Latte into StableDiffusion3 MMDiT

Open kabachuha opened this issue 4 months ago • 6 comments

See https://github.com/PKU-YuanGroup/Open-Sora-Plan/issues/43 for the diagram

[x] Rewriting Transformer Blocks to process both text and latents
[x] >Joint Attention tweak
[x] > Layer norms, rms norms
[x] > Extracting params from the modified timestep embed
[x] > Fix params tensors sizes
[x] Add text encoders (CLIP and T5)
[x] Add pooling of text embeds influencing the timestep embed trough MLP
[ ] Test that it works, somehow...

Everyone is welcomed to contribute!

Current Latte/MMDiT diff for readability diff.txt

Mar 06 '24 14:03 kabachuha

Great!

Mar 06 '24 14:03 qqingzheng

Also making a mirror PR to OpenDiT https://github.com/NUS-HPC-AI-Lab/OpenDiT/pull/92

Mar 07 '24 11:03 kabachuha

@LinB203 @sennnnn, anyone willing to test?

Mar 12 '24 20:03 kabachuha

@LinB203 @sennnnn, anyone willing to test?

I will check it in few days.

Mar 13 '24 05:03 LinB203

Could you provide a version for un-conditon or class-condition, so that we can test it quickly. Otherwise the cost for testing text2video is so high. Thanks you.

Mar 13 '24 06:03 LinB203

Well, the entire point of MMDiT is to process both text and image embeddings at the same time. The cost may not be as high, if the better structure allows for greater quality while reducing the params number. (and with the dataset precaching #136 the experiments will be less risky)

I think a training on webvid's small low-res subset can show whether it works well

Mar 13 '24 08:03 kabachuha

Open-Sora-Plan Open-Sora-Plan copied to clipboard

Rewriting DiT/Latte into StableDiffusion3 MMDiT

Everyone is welcomed to contribute!

Open-Sora-Plan
Open-Sora-Plan copied to clipboard