CogVideo icon indicating copy to clipboard operation
CogVideo copied to clipboard

Fine-tuning results

Open xvjiarui opened this issue 1 year ago • 11 comments

Hi Team,

Thanks for providing the diffusers fine-tuning script. I just tried it out. It turns out results look strange.

https://github.com/user-attachments/assets/5e1764fd-27d6-466a-a7ce-fb130c80b9c6

https://github.com/user-attachments/assets/9faf714c-67bb-4e85-9300-4f26a7cfc91c

Prompt:

  1. A black and white animated scene unfolds with an anthropomorphic goat surrounded by musical notes and symbols, suggesting a playful environment. Mickey Mouse appears, leaning forward in curiosity as the goat remains still. The goat then engages with Mickey, who bends down to converse or react. The dynamics shift as Mickey grabs the goat, potentially in surprise or playfulness, amidst a minimalistic background. The scene captures the evolving relationship between the two characters in a whimsical, animated setting, emphasizing their interactions and emotions.
  2. A domestic scene unfolds indoors, with a parrot on a stand and a mouse-like character standing next to it, amidst a domestic setting. A lamp is knocked over, causing a sudden change in lighting and affecting the mood. The scene shifts to a maritime setting, where a sailor-like character is shown in dynamic poses near ship's wheel controls and a bell, with a view of waves and distant land through a window.

I didn't modify any training script. I directly run https://github.com/THUDM/CogVideo/blob/main/finetune/finetune_multi_rank.sh Did I missing something?

Btw, may I ask whether it is possible to share your training log and validation videos? I want to make sure I am getting correct results.

xvjiarui avatar Sep 22 '24 02:09 xvjiarui

Lora fine-tuning ablation experiment information, but it is in Chinese. I'm sorry that I haven't had time to translate it into English.

Ablation experiment:https://zhipu-ai.feishu.cn/wiki/OjIDwMEKniIby1kHQa4cMKibnhP

glide-the avatar Sep 22 '24 03:09 glide-the

Hi @glide-the

Thank you so much for your prompt reply. The documentation is very helpful!

So in your experiments, it seems LoRA style fine-tuning didn't yield plausible results either. I am very interested in this problem as well. May I ask whether it is possible to share the dataset you are using? Truly appreciate your help and efforts in open-source models!

xvjiarui avatar Sep 22 '24 04:09 xvjiarui

@xvjiarui HF diffuser Datasethttps://huggingface.co/datasets/Wild-Heart/Tom-and-Jerry-VideoGeneration-Dataset

wandb https://wandb.ai/dmeck/cogv_2b_lora_tom_and_jerry_2_002

I used the SAT training framework to complete the LoRa fine-tuning. When using the diffusers framework, you need to pay attention to the following information

Here are a few key differences in the diffusers framework used in our publicly released SAT fine-tuning code: LoRA weights have a rank parameter, with the 2B transformer model defaulting to a rank of 128, and the 5B transformer model defaulting to a rank of 256. The lora_scale is calculated as alpha / lora_r, where alpha is typically set to 1 during SAT training to ensure stability and prevent underflow. Higher rank offers better expressiveness, but it also demands more memory and results in longer training times.

glide-the avatar Sep 22 '24 09:09 glide-the

@zRzRzRzRzRzRzR can you create a fine-tuning notebook? (with single and multi-gpu support) ? that way it's all in one place? the example disney dataset could do. it would be great if you could make it into like a notebook. that way we can test according to the best recommended format.

sometimes issues come up with setting accelerate config, it doesn't use MULTI-GPU, it changes to MULTI_CPU at the start, to avoid such issues, it would be great if you could create a fine-tuning notebook for CogvideoX 2B and 5B. (Full parameter and LoRA) on the disney dataset for starters.

GeeveGeorge avatar Sep 22 '24 12:09 GeeveGeorge

@zRzRzRzRzRzRzR why does the config say : main_process_ip: 10.250.128.19 main_process_port: 12355

GeeveGeorge avatar Sep 22 '24 13:09 GeeveGeorge

@zRzRzRzRzRzRzR i changed main_process_ip : localhost but still when i run with this config file on my multi-GPU Titan V system (3x Titan V with 12GB) , it switches to multi-cpu mode for some reason. which is why a notebook would be great.

GeeveGeorge avatar Sep 22 '24 13:09 GeeveGeorge

Okay, we are currently preparing to promote the fine-tuning of the diffusers version, this code can infer on a single GPU (however, it is currently a single A100)

zRzRzRzRzRzRzR avatar Sep 23 '24 10:09 zRzRzRzRzRzRzR

@xvjiarui HF diffuser Datasethttps://huggingface.co/datasets/Wild-Heart/Tom-and-Jerry-VideoGeneration-Dataset

wandb https://wandb.ai/dmeck/cogv_2b_lora_tom_and_jerry_2_002

I used the SAT training framework to complete the LoRa fine-tuning. When using the diffusers framework, you need to pay attention to the following information

Here are a few key differences in the diffusers framework used in our publicly released SAT fine-tuning code: LoRA weights have a rank parameter, with the 2B transformer model defaulting to a rank of 128, and the 5B transformer model defaulting to a rank of 256. The lora_scale is calculated as alpha / lora_r, where alpha is typically set to 1 during SAT training to ensure stability and prevent underflow. Higher rank offers better expressiveness, but it also demands more memory and results in longer training times.

Hi @glide-the

Thank you so much again for the great resources. I checked out the dataset you uploaded. I just noticed that some video clips have the exactly same text prompt. May I ask whether you used THUDM/cogvlm2-llama3-caption to caption these videos? Is it possible that the current unplausible Tom & Jerry results are because of low-quality text captions?

xvjiarui avatar Sep 23 '24 21:09 xvjiarui

Okay, we are currently preparing to promote the fine-tuning of the diffusers version, this code can infer on a single GPU (however, it is currently a single A100)

@zRzRzRzRzRzRzR looking forward to it. hope you also include multi-gpu support, maybe a finetuning notebook (maybe some tips to prepare custom dataset, like using the captioner etc) , it would be great to have a step by step guide.

GeeveGeorge avatar Sep 24 '24 06:09 GeeveGeorge

是的,我们使用

@xvjiarui HF diffuser Datasethttps://huggingface.co/datasets/Wild-Heart/Tom-and-Jerry-VideoGeneration-DatasetHF 扩散器数据集https://huggingface.co/datasets/Wild-Heart/Tom-and-Jerry-VideoGeneration-Dataset wandb https://wandb.ai/dmeck/cogv_2b_lora_tom_and_jerry_2_002 I used the SAT training framework to complete the LoRa fine-tuning. When using the diffusers framework, you need to pay attention to the following information我使用了 SAT 训练框架来完成 LoRa 微调。在使用 diffusers 框架时,需要注意以下信息 Here are a few key differences in the diffusers framework used in our publicly released SAT fine-tuning code: LoRA weights have a rank parameter, with the 2B transformer model defaulting to a rank of 128, and the 5B transformer model defaulting to a rank of 256. The lora_scale is calculated as alpha / lora_r, where alpha is typically set to 1 during SAT training to ensure stability and prevent underflow. Higher rank offers better expressiveness, but it also demands more memory and results in longer training times.在我们公开发布的 SAT 微调代码中使用的 diffusers 框架有几个关键区别:LoRA 权重有一个秩参数,2B 变压器模型默认秩为 128,5B 变压器模型默认秩为 256。lora_scale 的计算方式为 alpha / lora_r,其中 alpha 在 SAT 训练期间通常设置为 1,以确保稳定性并防止下溢。**更高的秩提供了更好的表达能力,**但它也需要更多内存,并且导致更长的训练时间。

Hi 你好@glide-the

Thank you so much again for the great resources. I checked out the dataset you uploaded. I just noticed that some video clips have the exactly same text prompt. May I ask whether you used THUDM/cogvlm2-llama3-caption to caption these videos? Is it possible that the current unplausible Tom & Jerry results are because of low-quality text captions?再次非常感谢您提供的优秀资源。我查看了您上传的数据集。我注意到一些视频片段有完全相同的文本提示。请问您是否使用了THUDM/cogvlm2-llama3-caption来为这些视频添加字幕?当前的 Tom & Jerry 结果不理想是否可能是因为低质量的文本字幕?

We used the THUDM/cogvlm2-llama3-caption model for annotation, and of course, MiniCPM-V-2.6 is also very good. Although it’s a general-purpose model, the annotation results are still quite decent. If your hardware is limited, you might want to try using this model. Regarding the T&J dataset, I think it could serve as a good reference, although I didn’t create it. You can find it here: https://huggingface.co/datasets/Wild-Heart/Tom-and-Jerry-VideoGeneration-Dataset.

zRzRzRzRzRzRzR avatar Sep 24 '24 12:09 zRzRzRzRzRzRzR

Okay, we are currently preparing to promote the fine-tuning of the diffusers version, this code can infer on a single GPU (however, it is currently a single A100)好的,我们目前正在准备推广扩散器版本的微调,这段代码可以在单个 GPU 上进行推理(不过,目前是单个 A100)

@zRzRzRzRzRzRzR looking forward to it. hope you also include multi-gpu support, maybe a finetuning notebook (maybe some tips to prepare custom dataset, like using the captioner etc) , it would be great to have a step by step guide.期待。希望您也能包含多 GPU 支持,或许一个微调笔记本(也许一些准备自定义数据集的技巧,比如使用标注器等),有一个逐步指南就太好了。

We expect to complete it by early October, including support for single GPU, multi-GPU, and multi-machine multi-GPU setups.

zRzRzRzRzRzRzR avatar Sep 24 '24 12:09 zRzRzRzRzRzRzR

是的,我们使用

@xvjiarui HF diffuser Datasethttps://huggingface.co/datasets/Wild-Heart/Tom-and-Jerry-VideoGeneration-DatasetHF 扩散器数据集https://huggingface.co/datasets/Wild-Heart/Tom-and-Jerry-VideoGeneration-Dataset wandb https://wandb.ai/dmeck/cogv_2b_lora_tom_and_jerry_2_002 I used the SAT training framework to complete the LoRa fine-tuning. When using the diffusers framework, you need to pay attention to the following information我使用了 SAT 训练框架来完成 LoRa 微调。在使用 diffusers 框架时,需要注意以下信息 Here are a few key differences in the diffusers framework used in our publicly released SAT fine-tuning code: LoRA weights have a rank parameter, with the 2B transformer model defaulting to a rank of 128, and the 5B transformer model defaulting to a rank of 256. The lora_scale is calculated as alpha / lora_r, where alpha is typically set to 1 during SAT training to ensure stability and prevent underflow. Higher rank offers better expressiveness, but it also demands more memory and results in longer training times.在我们公开发布的 SAT 微调代码中使用的 diffusers 框架有几个关键区别:LoRA 权重有一个秩参数,2B 变压器模型默认秩为 128,5B 变压器模型默认秩为 256。lora_scale 的计算方式为 alpha / lora_r,其中 alpha 在 SAT 训练期间通常设置为 1,以确保稳定性并防止下溢。**更高的秩提供了更好的表达能力,**但它也需要更多内存,并且导致更长的训练时间。

Hi 你好@glide-the Thank you so much again for the great resources. I checked out the dataset you uploaded. I just noticed that some video clips have the exactly same text prompt. May I ask whether you used THUDM/cogvlm2-llama3-caption to caption these videos? Is it possible that the current unplausible Tom & Jerry results are because of low-quality text captions?再次非常感谢您提供的优秀资源。我查看了您上传的数据集。我注意到一些视频片段有完全相同的文本提示。请问您是否使用了THUDM/cogvlm2-llama3-caption来为这些视频添加字幕?当前的 Tom & Jerry 结果不理想是否可能是因为低质量的文本字幕?

We used the THUDM/cogvlm2-llama3-caption model for annotation, and of course, MiniCPM-V-2.6 is also very good. Although it’s a general-purpose model, the annotation results are still quite decent. If your hardware is limited, you might want to try using this model. Regarding the T&J dataset, I think it could serve as a good reference, although I didn’t create it. You can find it here: https://huggingface.co/datasets/Wild-Heart/Tom-and-Jerry-VideoGeneration-Dataset.

___nice to have the topic on "THUDM/cogvlm2-llama3-caption" model for annotation here, this model can run on A5000(24G) ?

wr0124 avatar Dec 13 '24 08:12 wr0124