Open-Sora
Open-Sora copied to clipboard
为什么3D VAE的结果仔细看是由64*64的网格组成的?
如图所示,直接使用3d vae重建sora的example,会发现结果是6464的patch组成的,重建512512的视频会有88个patch,10241024的视频会有1616个patch。我找遍了code也没有发现哪里有patch的构建,6464的patch对应到latent上应该是8*8个latent为一组进行处理,可代码中并没有这个操作。
This issue is stale because it has been open for 7 days with no activity.
Is this what you are finding? https://github.com/hpcaitech/Open-Sora/blob/476b6dc79720e5d9ddfb3cd589680b2308871926/opensora/models/layers/blocks.py#L79
Is this what you are finding?
https://github.com/hpcaitech/Open-Sora/blob/476b6dc79720e5d9ddfb3cd589680b2308871926/opensora/models/layers/blocks.py#L79
This PatchEmbed3D is used to process the output of 3DVAE Encoder's output and then feed the patched latents into Dit model. But what I was observed is totally caused by 3DVAE, I let 3DVAE reconstruct a real video, and found this phenomenon .
尤其是细节比较丰富的视频,比如花海、草丛这种,网格非常明显。VAE还有比较大的优化空间。
This issue is stale because it has been open for 7 days with no activity.
This issue was closed because it has been inactive for 7 days since being marked as stale.