CogVideo Work plan and enhancement / 工作计划和用户诉求

Tasks that have been identified and scheduled:

Fine-tuning support for Diffusers version models
Adaptation for CPU / NPU inference frameworks (e.g., Huawei, Intel devices)
ComfyUI adaptation work and plugin support

已经明确并排期的任务：

CogVideoX1.5 Diffusers版本适配
CogVideoX-Factory等生态工具的更新

如果你有更多诉求，欢迎在这里提出

Aug 28 '24 13:08 zRzRzRzRzRzRzR

#182 #191 #47 #84 have similar content, all looking forward to the open-source CogVideoX I2V model. We are conducting research and evaluation.

#111 #186 are similar, both expecting to provide fine-tuning work for VAE. We will try to place it in the fine-tuning version, and there is a probability that it can be adapted in diffusers fine-tuning, but it will consume relatively high resources

Aug 28 '24 13:08 zRzRzRzRzRzRzR

5b image to video please! I2V would be lovely!

Aug 28 '24 13:08 rookiemann

The 3D VAE model consumes significantly more memory compared to diffusion models, which is severely limiting the batch size for fine-tuning. Any suggestions or optimizations to reduce memory usage would be greatly appreciated.

Aug 30 '24 03:08 PR-Ryan

The 3D VAE model consumes significantly more memory compared to diffusion models, which is severely limiting the batch size for fine-tuning. Any suggestions or optimizations to reduce memory usage would be greatly appreciated.

You make a very good point. We will work together with the Diffusers team to modify the fake quantization (fakecp) process in the VAE section to optimize it for lower memory usage. Please give us some time, as we will collaborate with the Diffusers team to develop a version of the model that is fine-tuned specifically for Diffusers, which is expected to save a significant amount of memory.

Aug 30 '24 06:08 zRzRzRzRzRzRzR

First of all, thank you for your excellent work!

The dataset format used SAT way for fine tuning & full training be the same as the format that will be used for fine-tuning Diffusers version models?

+ wrong discord link

Sep 05 '24 08:09 KihongK

We are currently completing several tasks

Adaptation work for the I2V model, expected to be open-sourced in September Detailed tutorial on model fine-tuning, expected to be completed in September

Work that has been completed

The model fine-tuned with SAT can be converted to a diffusers model and mounted directly. For specific usage, see here
The new Discord invitation link will be merged into the main branch today

Sep 12 '24 02:09 zRzRzRzRzRzRzR

When will vertical video generation be supported?

Sep 16 '24 17:09 sincerity711

When will vertical video generation be supported?

The current model cannot generate vertical videos, such as 480x720 resolution. We are working on fine-tuning to reach this capability, but it’s still in progress. Once we have any updates, we will share them as soon as possible.

Sep 17 '24 03:09 zRzRzRzRzRzRzR

Two related issues working now:

Diffusers supports CogVideoX-I2V. This PR has been merged, but the patch has not yet been released.
Fine-tuning the Diffusers version of CogVideoX-2B T2V without using the SAT model, running directly under the Diffusers framework. It can run on a single A100 GPU. This PR is still under debugging. I am working with members of the Diffusers team to attempt fine-tuning the CogVideoX-5B and I2V models. We will provide a small dataset (a few dozen samples) for this PR, which is sufficient for LoRA fine-tuning CogVideoX.

Many thanks to @a-r-r-o-w for the help with these two tasks!

Sep 17 '24 03:09 zRzRzRzRzRzRzR

when will CogVideoX-2B-I2V be released?

Sep 22 '24 15:09 JH-Xie

Various resolution support. Maybe RoPE + resize data to random resolution will achieve this?
Control ability of model, more than just text prompt.

Sep 25 '24 03:09 SanGilbert

@zRzRzRzRzRzRzR Many thanks to you and the team! I know fine-tuning vae is not very useful, but I'm curious is there any way I can just fine-tuning decoder part?

Sep 28 '24 00:09 Florenyci

Our publicly available fine-tuning code is for the fine-tuning of the transformers part, not for vae. We indeed have not updated the training and fine-tuning parts of vae (because I have not received the corresponding permissions either). Additionally, fine-tuning vae alone seems to have little significance for the overall model effect. If you want to try fine-tuning the diffusers model, all fine-tuning of the transformers module is already in dev, currently it is lora, and it is expected to implement SFT by early October.

Sep 28 '24 08:09 zRzRzRzRzRzRzR

Our publicly available fine-tuning code is for the fine-tuning of the transformers part, not for vae. We indeed have not updated the training and fine-tuning parts of vae (because I have not received the corresponding permissions either). Additionally, fine-tuning vae alone seems to have little significance for the overall model effect. If you want to try fine-tuning the diffusers model, all fine-tuning of the transformers module is already in dev, currently it is lora, and it is expected to implement SFT by early October.

@zRzRzRzRzRzRzR thank you but actually I'm asking how to fine-tuning VAE decoder, any advice?

Sep 30 '24 23:09 Florenyci

Hi @zRzRzRzRzRzRzR, what's your plan about diffuser I2V lora fine-tune code? Thanks!

Oct 01 '24 10:10 chenshuo20

@zRzRzRzRzRzRzR

Thank you for your great works!

I would like to covert a full-finetuned 2b model weight in sat into a model weight in diffusers. My full-finetuned model weight in sat is approx. 22GB. Therefore, I do not convert the model by your conversion code: python ../tools/convert_weight_sat2hf.py. This model weight may includes transformer and optimizer weight and so on. Therefore, I tried to extract transformer from the model weight only. However, something went on.

How can I do it?

Oct 08 '24 05:10 alfredplpl

I could convert the full-finetuned weight into the diffusers weight. We need extract limited keys by the script:

import torch

# ファイルAとファイルBのパス
file_a_path = 'file_a.pt'
file_b_path = 'file_b.pt'
file_c_path = 'file_c.pt'

# ファイルAをロードしてキーを取得
file_a = torch.load(file_a_path)

# ファイルBをロード
file_b = torch.load(file_b_path)

# 新しいファイルCに入れるデータを保存する辞書
file_c_data = {}

# 再帰的に辞書のキーに基づいて値を取り出す関数
def extract_values_from_b(a_dict, b_dict):
    result = {}
    for key, value in a_dict.items():
        if isinstance(value, dict):  # 値がさらに辞書の場合は再帰処理
            if key in b_dict and isinstance(b_dict[key], dict):
                result[key] = extract_values_from_b(value, b_dict[key])
            else:
                print(f"Key '{key}' not found or not a dict in file B")
        else:
            if key in b_dict:
                result[key] = b_dict[key]
            else:
                print(f"Key '{key}' not found in file B")
    return result

# ファイルAの構造に基づいてファイルBから値を取得
file_c_data = extract_values_from_b(file_a, file_b)

# 新しいファイルCに保存
torch.save(file_c_data, file_c_path)

print(f"New .pt file saved at: {file_c_path}")

Please feel free to ask me for further detail.

Oct 09 '24 04:10 alfredplpl

video outpainting fintune scripts

Oct 15 '24 06:10 dushwe

I want to enable multiple gpus, but it go wrong

# 3. Enable CPU offload for the model.
# turn off if you have multiple GPUs or enough GPU memory(such as H100) and it will cost less time in inference
# and enable to("cuda")

pipe.to("cuda")

# pipe.enable_sequential_cpu_offload()

pipe.vae.enable_slicing()
pipe.vae.enable_tiling()

$ python cli_demo.py --prompt "A girl riding a bike." --model_path THUDM/CogVideoX-5b --generate_type "t2v" Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████| 2/2 [00:02<00:00, 1.13s/it] Loading pipeline components...: 100%|████████████████████████████████████████████████████████████████████████████| 5/5 [00:06<00:00, 1.37s/it] Traceback (most recent call last): File "/home/ubuntu/gamehub/CogVideo/inference/cli_demo.py", line 177, in generate_video( File "/home/ubuntu/gamehub/CogVideo/inference/cli_demo.py", line 99, in generate_video pipe.to("cuda") File "/opt/conda/envs/cogvideo/lib/python3.10/site-packages/diffusers/pipelines/pipeline_utils.py", line 396, in to raise ValueError( ValueError: It seems like you have activated sequential model offloading by calling enable_sequential_cpu_offload, but are now attempting to move the pipeline to GPU. This is not compatible with offloading. Please, move your pipeline .to('cpu') or consider removing the move altogether if you use sequential offloading.

Oct 19 '24 12:10 jumbo-q

我们正在适配CogVideoX的diffusers版本代码，预计在11月17日左右完成开源工作，届时，显存消耗会大量降低。敬请期待

Nov 10 '24 09:11 zRzRzRzRzRzRzR

Big thanks for your work. The code and all weights works fine on 8GB VRAM. The v1.5 is slow but it's understood as the resolution is high.

Please consider releasing 480x720 resolution models for vertical videos, if possible. The processing time is fast with this resolution.

More users requesting this / want to create vertical videos https://github.com/THUDM/CogVideo/issues/521

Nov 19 '24 09:11 nitinmukesh

Requesting to have 2 cli.py for v1 and v1.5 models. I think It's getting confusing regarding resolution, etc. Having different inference file will help. cli_demo_v1.py cli_demo_v1.5.py

reference issue: https://github.com/THUDM/CogVideo/issues/517

Nov 19 '24 09:11 nitinmukesh

The brand new fine-tuning work for the models of the diffusers version will be completed in these two PRs https://github.com/THUDM/CogVideo/pull/654 and https://github.com/THUDM/CogVideo/pull/642 At that time, the diffusers version of the CogVideoX 1.0 and 1.5 models will have a more user-friendly and lower-cost Lora and SFT fine-tuning solution compared to before.

Jan 12 '25 07:01 zRzRzRzRzRzRzR

The inversion function is important for some editing tasks. Will the cogvideoX implement related interfaces?

Jan 17 '25 05:01 lisuyi

The inversion function is important for some editing tasks. Will the cogvideoX implement related interfaces?

Same question!

Jan 20 '25 15:01 nini0919

@zRzRzRzRzRzRzR still have any plans/work done for a native ComfyUI node support? Would greatly appreciate it for the hobbyist space. 🙏

Jan 30 '25 01:01 Griphen116

The inversion function is important for some editing tasks. Will the cogvideoX implement related interfaces?

This PR may mention your concerns, but it currently does not support CogVideoX-2B; for the 5B model, it can work normally.

Feb 25 '25 05:02 zRzRzRzRzRzRzR

@zRzRzRzRzRzRzR still have any plans/work done for a native ComfyUI node support? Would greatly appreciate it for the hobbyist space. 🙏

We currently do not have enough manpower to support the full development of ComfyUI, but I have noticed that there is similar work in the community that can support it.

Feb 25 '25 05:02 zRzRzRzRzRzRzR

飞书的技术文档怎么没有了

Mar 14 '25 11:03 Haodong-Lei-Ray

飞书的技术文档怎么没有了

https://zhipu-ai.feishu.cn/wiki/DHCjw1TrJiTyeukfc9RceoSRnCh

Mar 24 '25 05:03 zRzRzRzRzRzRzR