OleehyO
OleehyO
Only the cogvideo1.5 series models support variable resolutions.
xdit mainly uses multiple cards to accelerate inference speed, but it will not save a lot of memory, and for more details, it is recommended to consult the developers related...
The functionality in the cogvideo repository is no longer maintained. We recommend using [cogkit](https://github.com/THUDM/CogKit) to deploy the cogvideo API server, which will be more user-friendly.
Thank you for pointing it out. Indeed, `accelerate` may handle settings for multi-card parallelism on its own, so the scheduler should be configured according to single-card settings before preparing the...
This seems a bit odd, I don't know why it would try to read a ZIP archive (and your parameters appear to be fine), I suggest you check the [deepspeed...
这已经是老的代码了,建议看一下新的代码 #654 的 [i2v_dataset](https://github.com/THUDM/CogVideo/pull/654/files#diff-97324c19de2f6786b67869270d248eeef65fa5cd5d101767aa45b3efbefb1b0b) 和 [trainer](https://github.com/THUDM/CogVideo/pull/654/files#diff-43b8e64ed8482410b9c41b69ae1393735c29bc3575d5dc0dfb933ec8b9941a36R173)。 对于I2V来说用户必须指定8N + 1,然后视频直接采样8N + 1(老的代码是用户输入8N+1,在视频里采样8N,然后把图片复制到第一帧,从而变成8N + 1帧)。 这里以81帧为例:vae decode时下采样四倍得到21个latent,由于CogvideoX1.5的patch_t = 2,因此要让latent数目满足2的倍数,所以后面会给第一个latent复制一次,变成22个latent然后训练。 推理也是同理,用户要跟训练时一样输入8N + 1的帧数,同样以用户输入81帧为例:先在[这里](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/cogvideo/pipeline_cogvideox_image2video.py#L767-#L793)转换到85帧(加了4帧),再在[这里](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/cogvideo/pipeline_cogvideox_image2video.py#L367-#L374)算出latent的个数( (85-1)//4 + 1 = 22) vae解码前,需要先取出第一个复制出来的latent(参考[这里](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/cogvideo/pipeline_cogvideox_image2video.py#L882)),还剩21个latent,然后解码10次,每一次分别得到9帧,8帧,8帧, ...,8帧,最后得到81帧的视频(可以参考[这部分](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py#L1186-#L1210)代码)
emm...也可以这么理解吧,但实际上生成视频帧的时候不是一帧一帧生成的,准确来说这里指的应该是latent
I'm not entirely sure why you're encountering this issue on your end. It's recommended to first check if the diffusers version is updated to the latest. If the problem persists,...
It's quite strange because we didn't seem to encounter this issue during multi-GPU training. We will attempt to reproduce it later. We recommend you switch to [cogkit](https://github.com/THUDM/CogKit) for training first,...
Yes, but this will result in lower generation quality. We recommend using longer descriptions.