OleehyO comments

Results 21 comments of


                                            OleehyO

finetune CogVideoX-5B-I2V for different resolution.

Only the cogvideo1.5 series models support variable resolutions.

XDIT multi GPUs inference code runs OOM with 24 frames

xdit mainly uses multiple cards to accelerate inference speed, but it will not save a lot of memory, and for more details, it is recommended to consult the developers related...

return torch._C._nn.upsample_nearest3d(input, output_size, scale_factors) RuntimeError: "upsample_nearest3d_out_frame" not implemented for 'BFloat16'

The functionality in the cogvideo repository is no longer maintained. We recommend using [cogkit](https://github.com/THUDM/CogKit) to deploy the cogvideo API server, which will be more user-friendly.

LR Scheduler Bugs in Finetuning Code

Thank you for pointing it out. Indeed, `accelerate` may handle settings for multi-card parallelism on its own, so the scheduler should be configured according to single-card settings before preparing the...

Error when converting Zero-3 checkpoint: PytorchStreamReader failed reading zip archive: not a ZIP archive

This seems a bit odd, I don't know why it would try to read a ZIP archive (and your parameters appear to be fine), I suggest you check the [deepspeed...

finetune CogVideoX frames

这已经是老的代码了，建议看一下新的代码 #654 的 [i2v_dataset](https://github.com/THUDM/CogVideo/pull/654/files#diff-97324c19de2f6786b67869270d248eeef65fa5cd5d101767aa45b3efbefb1b0b) 和 [trainer](https://github.com/THUDM/CogVideo/pull/654/files#diff-43b8e64ed8482410b9c41b69ae1393735c29bc3575d5dc0dfb933ec8b9941a36R173)。对于I2V来说用户必须指定8N + 1，然后视频直接采样8N + 1（老的代码是用户输入8N+1，在视频里采样8N，然后把图片复制到第一帧，从而变成8N + 1帧）。这里以81帧为例：vae decode时下采样四倍得到21个latent，由于CogvideoX1.5的patch_t = 2，因此要让latent数目满足2的倍数，所以后面会给第一个latent复制一次，变成22个latent然后训练。推理也是同理，用户要跟训练时一样输入8N + 1的帧数，同样以用户输入81帧为例：先在[这里](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/cogvideo/pipeline_cogvideox_image2video.py#L767-#L793)转换到85帧（加了4帧），再在[这里](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/cogvideo/pipeline_cogvideox_image2video.py#L367-#L374)算出latent的个数( (85-1)//4 + 1 = 22) vae解码前，需要先取出第一个复制出来的latent（参考[这里](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/cogvideo/pipeline_cogvideox_image2video.py#L882)），还剩21个latent，然后解码10次，每一次分别得到9帧，8帧，8帧， ...，8帧，最后得到81帧的视频（可以参考[这部分](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py#L1186-#L1210)代码）

OleehyO

finetune CogVideoX-5B-I2V for different resolution.

XDIT multi GPUs inference code runs OOM with 24 frames

return torch._C._nn.upsample_nearest3d(input, output_size, scale_factors) RuntimeError: "upsample_nearest3d_out_frame" not implemented for 'BFloat16'

LR Scheduler Bugs in Finetuning Code

Error when converting Zero-3 checkpoint: PytorchStreamReader failed reading zip archive: not a ZIP archive

finetune CogVideoX frames

finetune CogVideoX frames

Problematic result

Tensorboard logs seem to fail when multi-gpu training

Can generate short caption?