Open-Sora icon indicating copy to clipboard operation
Open-Sora copied to clipboard

为什么3D VAE的结果仔细看是由64*64的网格组成的?

Open Dorniwang opened this issue 1 year ago • 4 comments
trafficstars

20240723-231619 如图所示,直接使用3d vae重建sora的example,会发现结果是6464的patch组成的,重建512512的视频会有88个patch,10241024的视频会有1616个patch。我找遍了code也没有发现哪里有patch的构建,6464的patch对应到latent上应该是8*8个latent为一组进行处理,可代码中并没有这个操作。

Dorniwang avatar Jul 23 '24 15:07 Dorniwang

This issue is stale because it has been open for 7 days with no activity.

github-actions[bot] avatar Jul 31 '24 01:07 github-actions[bot]

Is this what you are finding? https://github.com/hpcaitech/Open-Sora/blob/476b6dc79720e5d9ddfb3cd589680b2308871926/opensora/models/layers/blocks.py#L79

JThh avatar Aug 07 '24 04:08 JThh

Is this what you are finding?

https://github.com/hpcaitech/Open-Sora/blob/476b6dc79720e5d9ddfb3cd589680b2308871926/opensora/models/layers/blocks.py#L79

This PatchEmbed3D is used to process the output of 3DVAE Encoder's output and then feed the patched latents into Dit model. But what I was observed is totally caused by 3DVAE, I let 3DVAE reconstruct a real video, and found this phenomenon .

Dorniwang avatar Aug 08 '24 01:08 Dorniwang

尤其是细节比较丰富的视频,比如花海、草丛这种,网格非常明显。VAE还有比较大的优化空间。

tyz1994 avatar Aug 13 '24 08:08 tyz1994

This issue is stale because it has been open for 7 days with no activity.

github-actions[bot] avatar Aug 23 '24 01:08 github-actions[bot]

This issue was closed because it has been inactive for 7 days since being marked as stale.

github-actions[bot] avatar Aug 30 '24 01:08 github-actions[bot]