Open-Sora 为什么3D VAE的结果仔细看是由64*64的网格组成的？

为什么3D VAE的结果仔细看是由64*64的网格组成的？

Open Dorniwang opened this issue 1 year ago • 4 comments

trafficstars

20240723-231619 如图所示，直接使用3d vae重建sora的example，会发现结果是6464的patch组成的，重建512512的视频会有88个patch，10241024的视频会有1616个patch。我找遍了code也没有发现哪里有patch的构建，6464的patch对应到latent上应该是8*8个latent为一组进行处理，可代码中并没有这个操作。

Jul 23 '24 15:07 Dorniwang

This issue is stale because it has been open for 7 days with no activity.

Jul 31 '24 01:07 github-actions[bot]

Is this what you are finding? https://github.com/hpcaitech/Open-Sora/blob/476b6dc79720e5d9ddfb3cd589680b2308871926/opensora/models/layers/blocks.py#L79

Aug 07 '24 04:08 JThh

Is this what you are finding?

https://github.com/hpcaitech/Open-Sora/blob/476b6dc79720e5d9ddfb3cd589680b2308871926/opensora/models/layers/blocks.py#L79

This PatchEmbed3D is used to process the output of 3DVAE Encoder's output and then feed the patched latents into Dit model. But what I was observed is totally caused by 3DVAE, I let 3DVAE reconstruct a real video, and found this phenomenon .

Aug 08 '24 01:08 Dorniwang

尤其是细节比较丰富的视频，比如花海、草丛这种，网格非常明显。VAE还有比较大的优化空间。

Aug 13 '24 08:08 tyz1994

This issue is stale because it has been open for 7 days with no activity.

Aug 23 '24 01:08 github-actions[bot]

This issue was closed because it has been inactive for 7 days since being marked as stale.

Aug 30 '24 01:08 github-actions[bot]

Open-Sora Open-Sora copied to clipboard

为什么3D VAE的结果仔细看是由64*64的网格组成的？

Open-Sora
Open-Sora copied to clipboard