CogVideo Full-parameter finetune后，生成视频的主体空间扭曲

您好！非常棒的开源repo~ 最近尝试了Lora和full-parameter的finetune，均使用同样的50个video，微调500次迭代，其余setting没有修改发现full-parameter的微调后，生成视频的主体会非常扭曲，lora的微调形式没有这种明显扭曲下面是同样的prompt： spider making a web的结果： full-parameter微调后

https://github.com/user-attachments/assets/19f4f8bb-973c-4b42-9a0c-422c1af29af0

lora微调后

https://github.com/user-attachments/assets/4ab01f8b-8e05-4fbc-b038-a2796c69adfa

不知道导致这一问题的原因是什么？是微调时的lr太高的原因吗？期待您的回复，感谢

Aug 12 '24 08:08 CacacaLalala

想知道您使用了多少数据进行微调，推荐使用100条相似的视频，以及，您使用了默认配置吗，能提供一下loss的下降情况吗

Aug 13 '24 05:08 zRzRzRzRzRzRzR

想知道您使用了多少数据进行微调，推荐使用100条相似的视频，以及，您使用了默认配置吗，能提供一下loss的下降情况吗

感谢您的回复！我是想实现在您的模型权重基础上继续用其他数据进行训练的功能，所以我是在数据集中先随机抽取了50条视频。是默认配置，training_config如下： `args: checkpoint_activations: true model_parallel_size: 1 experiment_name: finetune-openvid-framesmin180-max500-origin-dataset mode: finetune load: CogVideoX-2b-sat/transformer no_load_rng: true train_iters: 10000 eval_iters: 1 eval_interval: 10000 eval_batch_size: 1 save: output save_interval: 100 log_interval: 20 train_data:

dataset/mini_dataset/cogvideo/videos valid_data:
dataset/mini_dataset/cogvideo/videos split: 1,0,0 num_workers: 8 force_train: true only_log_video_latents: true data: target: data_video.SFTDataset params: video_size:
- 480
- 720 fps: 8 max_num_frames: 49 skip_frms_num: 3.0 deepspeed: train_micro_batch_size_per_gpu: 1 gradient_accumulation_steps: 1 steps_per_print: 50 gradient_clipping: 0.1 zero_optimization: stage: 2 cpu_offload: false contiguous_gradients: false overlap_comm: true reduce_scatter: true reduce_bucket_size: 1000000000 allgather_bucket_size: 1000000000 load_from_fp32_weights: false zero_allow_untested_optimizer: true bf16: enabled: false fp16: enabled: true loss_scale: 0 loss_scale_window: 400 hysteresis: 2 min_loss_scale: 1 optimizer: type: sat.ops.FusedEmaAdam params: lr: 0.0002 betas:
  - 0.9
  - 0.95 eps: 1.0e-08 weight_decay: 0.0001 activation_checkpointing: partition_activations: false contiguous_memory_optimization: false wall_clock_breakdown: false model: scale_factor: 1.15258426 disable_first_stage_autocast: true log_keys:
txt denoiser_config: target: sgm.modules.diffusionmodules.denoiser.DiscreteDenoiser params: num_idx: 1000 quantize_c_noise: false weighting_config: target: sgm.modules.diffusionmodules.denoiser_weighting.EpsWeighting scaling_config: target: sgm.modules.diffusionmodules.denoiser_scaling.VideoScaling discretization_config: target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization params: shift_scale: 3.0 network_config: target: dit_video_concat.DiffusionTransformer params: time_embed_dim: 512 elementwise_affine: true num_frames: 49 time_compressed_rate: 4 latent_width: 90 latent_height: 60 num_layers: 30 patch_size: 2 in_channels: 16 out_channels: 16 hidden_size: 1920 adm_in_channels: 256 num_attention_heads: 30 transformer_args: checkpoint_activations: true vocab_size: 1 max_sequence_length: 64 layernorm_order: pre skip_init: false model_parallel_size: 1 is_decoder: false modules: pos_embed_config: target: dit_video_concat.Basic3DPositionEmbeddingMixin params: text_length: 226 height_interpolation: 1.875 width_interpolation: 1.875 patch_embed_config: target: dit_video_concat.ImagePatchEmbeddingMixin params: text_hidden_size: 4096 adaln_layer_config: target: dit_video_concat.AdaLNMixin params: qk_ln: true final_layer_config: target: dit_video_concat.FinalLayerMixin conditioner_config: target: sgm.modules.GeneralConditioner params: emb_models:
- is_trainable: false input_key: txt ucg_rate: 0.1 target: sgm.modules.encoders.modules.FrozenT5Embedder params: model_dir: ckpts/cogvideo/t5-v1_1-xxl max_length: 226 first_stage_config: target: vae_modules.autoencoder.VideoAutoencoderInferenceWrapper params: cp_size: 1 ckpt_path: CogVideoX-2b-sat/vae/3d-vae.pt ignore_keys:
- loss loss_config: target: torch.nn.Identity regularizer_config: target: vae_modules.regularizers.DiagonalGaussianRegularizer encoder_config: target: vae_modules.cp_enc_dec.ContextParallelEncoder3D params: double_z: true z_channels: 16 resolution: 256 in_channels: 3 out_ch: 3 ch: 128 ch_mult:
  - 1
  - 2
  - 2
  - 4 attn_resolutions: [] num_res_blocks: 3 dropout: 0.0 gather_norm: true decoder_config: target: vae_modules.cp_enc_dec.ContextParallelDecoder3D params: double_z: true z_channels: 16 resolution: 256 in_channels: 3 out_ch: 3 ch: 128 ch_mult:
  - 1
  - 2
  - 2
  - 4 attn_resolutions: [] num_res_blocks: 3 dropout: 0.0 gather_norm: false loss_fn_config: target: sgm.modules.diffusionmodules.loss.VideoDiffusionLoss params: offset_noise_level: 0 sigma_sampler_config: target: sgm.modules.diffusionmodules.sigma_sampling.DiscreteSampling params: uniform_sampling: true num_idx: 1000 discretization_config: target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization params: shift_scale: 3.0 sampler_config: target: sgm.modules.diffusionmodules.sampling.VPSDEDPMPP2MSampler params: num_steps: 50 verbose: true discretization_config: target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization params: shift_scale: 3.0 guider_config: target: sgm.modules.diffusionmodules.guiders.DynamicCFG params: scale: 6 exp: 5 num_steps: 50 ` 不好意思我目前还没对repo做过多修改，loss还没有记录下来您在使用full parameter微调时有观察到这种空间扭曲的问题吗？我尝试降低学习率后，这一问题有所改善，但还是随着训练过程，扭曲问题会变得越来越严重 500次迭代：

https://github.com/user-attachments/assets/90ec5432-c226-4933-8c04-89a58df31e43

4000次迭代：

https://github.com/user-attachments/assets/73957b79-f1aa-4077-8e9d-9cab11e2da53

期待您的回复~

Aug 13 '24 06:08 CacacaLalala

Yes, for lora, lr 1e-4~1e-3 is OK. But for full-parameter fine-tune, lr 1e-5 is OK. We will update config files and fine-tune instructions soon.

Aug 13 '24 09:08 tengjiayan20

Yes, for lora, lr 1e-4~1e-3 is OK. But for full-parameter fine-tune, lr 1e-5 is OK. We will update config files and fine-tune instructions soon.

Are there other factors besides the learning rate? Because the learning rate I am currently using is 1e-5, but as the training progresses, I will still observe a gradual decline in spatial ability. Looking forward to your reply!

Aug 13 '24 09:08 CacacaLalala

Yes, for lora, lr 1e-4~1e-3 is OK. But for full-parameter fine-tune, lr 1e-5 is OK. We will update config files and fine-tune instructions soon.

Are there other factors besides the learning rate? Because the learning rate I am currently using is 1e-5, but as the training progresses, I will still observe a gradual decline in spatial ability. Looking forward to your reply!

Is the prompt you use, "spider making a web", too different from your sft training data? And what is the total batch size? And in theory, for a small dataset with size 50, too much training will make model overfit data, resulting in totally same videos.

Aug 13 '24 09:08 tengjiayan20

Yes, for lora, lr 1e-4~1e-3 is OK. But for full-parameter fine-tune, lr 1e-5 is OK. We will update config files and fine-tune instructions soon.

Are there other factors besides the learning rate? Because the learning rate I am currently using is 1e-5, but as the training progresses, I will still observe a gradual decline in spatial ability. Looking forward to your reply!

Is the prompt you use, "spider making a web", too different from your sft training data? And what is the total batch size? And in theory, for a small dataset with size 50, too much training will make model overfit data, resulting in totally same videos.

The total batch size is 24*2, and I'm using 100w dataset by changing dataset part. Next, waiting for more iterations, I test the training again. Thanks a lot!

Aug 13 '24 09:08 CacacaLalala

想知道您使用了多少数据进行微调，推荐使用100条相似的视频，以及，您使用了默认配置吗，能提供一下loss的下降情况吗

感谢您的回复！我是想实现在您的模型权重基础上继续用其他数据进行训练的功能，所以我是在数据集中先随机抽取了50条视频。是默认配置，training_config如下： `args: checkpoint_activations: true model_parallel_size: 1 experiment_name: finetune-openvid-framesmin180-max500-origin-dataset mode: finetune load: CogVideoX-2b-sat/transformer no_load_rng: true train_iters: 10000 eval_iters: 1 eval_interval: 10000 eval_batch_size: 1 save: output save_interval: 100 log_interval: 20 train_data:

dataset/mini_dataset/cogvideo/videos valid_data:

dataset/mini_dataset/cogvideo/videos split: 1,0,0 num_workers: 8 force_train: true only_log_video_latents: true data: target: data_video.SFTDataset params: video_size:

480

720 fps: 8 max_num_frames: 49 skip_frms_num: 3.0 deepspeed: train_micro_batch_size_per_gpu: 1 gradient_accumulation_steps: 1 steps_per_print: 50 gradient_clipping: 0.1 zero_optimization: stage: 2 cpu_offload: false contiguous_gradients: false overlap_comm: true reduce_scatter: true reduce_bucket_size: 1000000000 allgather_bucket_size: 1000000000 load_from_fp32_weights: false zero_allow_untested_optimizer: true bf16: enabled: false fp16: enabled: true loss_scale: 0 loss_scale_window: 400 hysteresis: 2 min_loss_scale: 1 optimizer: type: sat.ops.FusedEmaAdam params: lr: 0.0002 betas:

0.9

0.95 eps: 1.0e-08 weight_decay: 0.0001 activation_checkpointing: partition_activations: false contiguous_memory_optimization: false wall_clock_breakdown: false model: scale_factor: 1.15258426 disable_first_stage_autocast: true log_keys:

txt denoiser_config: target: sgm.modules.diffusionmodules.denoiser.DiscreteDenoiser params: num_idx: 1000 quantize_c_noise: false weighting_config: target: sgm.modules.diffusionmodules.denoiser_weighting.EpsWeighting scaling_config: target: sgm.modules.diffusionmodules.denoiser_scaling.VideoScaling discretization_config: target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization params: shift_scale: 3.0 network_config: target: dit_video_concat.DiffusionTransformer params: time_embed_dim: 512 elementwise_affine: true num_frames: 49 time_compressed_rate: 4 latent_width: 90 latent_height: 60 num_layers: 30 patch_size: 2 in_channels: 16 out_channels: 16 hidden_size: 1920 adm_in_channels: 256 num_attention_heads: 30 transformer_args: checkpoint_activations: true vocab_size: 1 max_sequence_length: 64 layernorm_order: pre skip_init: false model_parallel_size: 1 is_decoder: false modules: pos_embed_config: target: dit_video_concat.Basic3DPositionEmbeddingMixin params: text_length: 226 height_interpolation: 1.875 width_interpolation: 1.875 patch_embed_config: target: dit_video_concat.ImagePatchEmbeddingMixin params: text_hidden_size: 4096 adaln_layer_config: target: dit_video_concat.AdaLNMixin params: qk_ln: true final_layer_config: target: dit_video_concat.FinalLayerMixin conditioner_config: target: sgm.modules.GeneralConditioner params: emb_models:

is_trainable: false input_key: txt ucg_rate: 0.1 target: sgm.modules.encoders.modules.FrozenT5Embedder params: model_dir: ckpts/cogvideo/t5-v1_1-xxl max_length: 226 first_stage_config: target: vae_modules.autoencoder.VideoAutoencoderInferenceWrapper params: cp_size: 1 ckpt_path: CogVideoX-2b-sat/vae/3d-vae.pt ignore_keys:

loss loss_config: target: torch.nn.Identity regularizer_config: target: vae_modules.regularizers.DiagonalGaussianRegularizer encoder_config: target: vae_modules.cp_enc_dec.ContextParallelEncoder3D params: double_z: true z_channels: 16 resolution: 256 in_channels: 3 out_ch: 3 ch: 128 ch_mult:

1

2

2

4 attn_resolutions: [] num_res_blocks: 3 dropout: 0.0 gather_norm: true decoder_config: target: vae_modules.cp_enc_dec.ContextParallelDecoder3D params: double_z: true z_channels: 16 resolution: 256 in_channels: 3 out_ch: 3 ch: 128 ch_mult:

1

2

2

4 attn_resolutions: [] num_res_blocks: 3 dropout: 0.0 gather_norm: false loss_fn_config: target: sgm.modules.diffusionmodules.loss.VideoDiffusionLoss params: offset_noise_level: 0 sigma_sampler_config: target: sgm.modules.diffusionmodules.sigma_sampling.DiscreteSampling params: uniform_sampling: true num_idx: 1000 discretization_config: target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization params: shift_scale: 3.0 sampler_config: target: sgm.modules.diffusionmodules.sampling.VPSDEDPMPP2MSampler params: num_steps: 50 verbose: true discretization_config: target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization params: shift_scale: 3.0 guider_config: target: sgm.modules.diffusionmodules.guiders.DynamicCFG params: scale: 6 exp: 5 num_steps: 50 ` 不好意思我目前还没对repo做过多修改，loss还没有记录下来您在使用full parameter微调时有观察到这种空间扭曲的问题吗？我尝试降低学习率后，这一问题有所改善，但还是随着训练过程，扭曲问题会变得越来越严重 500次迭代：

000000.mp4 4000次迭代：

000000.mp4 期待您的回复~

看起来4000步的结果也还比较正常，请问这里说的扭曲问题具体是指什么呢？

Aug 16 '24 05:08 GFENGG

想知道您使用了多少数据进行微调，推荐使用100条相似的视频，以及，您使用了默认配置吗，能提供一下loss的下降情况吗

感谢您的回复！我是想实现在您的模型权重基础上继续用其他数据进行训练的功能，所以我是在数据集中先随机抽取了50条视频。是默认配置，training_config如下： `args: checkpoint_activations: true model_parallel_size: 1 experiment_name: finetune-openvid-framesmin180-max500-origin-dataset mode: finetune load: CogVideoX-2b-sat/transformer no_load_rng: true train_iters: 10000 eval_iters: 1 eval_interval: 10000 eval_batch_size: 1 save: output save_interval: 100 log_interval: 20 train_data:

dataset/mini_dataset/cogvideo/videos valid_data:

dataset/mini_dataset/cogvideo/videos split: 1,0,0 num_workers: 8 force_train: true only_log_video_latents: true data: target: data_video.SFTDataset params: video_size:

480

720 fps: 8 max_num_frames: 49 skip_frms_num: 3.0 deepspeed: train_micro_batch_size_per_gpu: 1 gradient_accumulation_steps: 1 steps_per_print: 50 gradient_clipping: 0.1 zero_optimization: stage: 2 cpu_offload: false contiguous_gradients: false overlap_comm: true reduce_scatter: true reduce_bucket_size: 1000000000 allgather_bucket_size: 1000000000 load_from_fp32_weights: false zero_allow_untested_optimizer: true bf16: enabled: false fp16: enabled: true loss_scale: 0 loss_scale_window: 400 hysteresis: 2 min_loss_scale: 1 optimizer: type: sat.ops.FusedEmaAdam params: lr: 0.0002 betas:

0.9

0.95 eps: 1.0e-08 weight_decay: 0.0001 activation_checkpointing: partition_activations: false contiguous_memory_optimization: false wall_clock_breakdown: false model: scale_factor: 1.15258426 disable_first_stage_autocast: true log_keys:

txt denoiser_config: target: sgm.modules.diffusionmodules.denoiser.DiscreteDenoiser params: num_idx: 1000 quantize_c_noise: false weighting_config: target: sgm.modules.diffusionmodules.denoiser_weighting.EpsWeighting scaling_config: target: sgm.modules.diffusionmodules.denoiser_scaling.VideoScaling discretization_config: target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization params: shift_scale: 3.0 network_config: target: dit_video_concat.DiffusionTransformer params: time_embed_dim: 512 elementwise_affine: true num_frames: 49 time_compressed_rate: 4 latent_width: 90 latent_height: 60 num_layers: 30 patch_size: 2 in_channels: 16 out_channels: 16 hidden_size: 1920 adm_in_channels: 256 num_attention_heads: 30 transformer_args: checkpoint_activations: true vocab_size: 1 max_sequence_length: 64 layernorm_order: pre skip_init: false model_parallel_size: 1 is_decoder: false modules: pos_embed_config: target: dit_video_concat.Basic3DPositionEmbeddingMixin params: text_length: 226 height_interpolation: 1.875 width_interpolation: 1.875 patch_embed_config: target: dit_video_concat.ImagePatchEmbeddingMixin params: text_hidden_size: 4096 adaln_layer_config: target: dit_video_concat.AdaLNMixin params: qk_ln: true final_layer_config: target: dit_video_concat.FinalLayerMixin conditioner_config: target: sgm.modules.GeneralConditioner params: emb_models:

is_trainable: false input_key: txt ucg_rate: 0.1 target: sgm.modules.encoders.modules.FrozenT5Embedder params: model_dir: ckpts/cogvideo/t5-v1_1-xxl max_length: 226 first_stage_config: target: vae_modules.autoencoder.VideoAutoencoderInferenceWrapper params: cp_size: 1 ckpt_path: CogVideoX-2b-sat/vae/3d-vae.pt ignore_keys:

loss loss_config: target: torch.nn.Identity regularizer_config: target: vae_modules.regularizers.DiagonalGaussianRegularizer encoder_config: target: vae_modules.cp_enc_dec.ContextParallelEncoder3D params: double_z: true z_channels: 16 resolution: 256 in_channels: 3 out_ch: 3 ch: 128 ch_mult:

1

2

2

4 attn_resolutions: [] num_res_blocks: 3 dropout: 0.0 gather_norm: true decoder_config: target: vae_modules.cp_enc_dec.ContextParallelDecoder3D params: double_z: true z_channels: 16 resolution: 256 in_channels: 3 out_ch: 3 ch: 128 ch_mult:

1

2

2

4 attn_resolutions: [] num_res_blocks: 3 dropout: 0.0 gather_norm: false loss_fn_config: target: sgm.modules.diffusionmodules.loss.VideoDiffusionLoss params: offset_noise_level: 0 sigma_sampler_config: target: sgm.modules.diffusionmodules.sigma_sampling.DiscreteSampling params: uniform_sampling: true num_idx: 1000 discretization_config: target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization params: shift_scale: 3.0 sampler_config: target: sgm.modules.diffusionmodules.sampling.VPSDEDPMPP2MSampler params: num_steps: 50 verbose: true discretization_config: target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization params: shift_scale: 3.0 guider_config: target: sgm.modules.diffusionmodules.guiders.DynamicCFG params: scale: 6 exp: 5 num_steps: 50 ` 不好意思我目前还没对repo做过多修改，loss还没有记录下来您在使用full parameter微调时有观察到这种空间扭曲的问题吗？我尝试降低学习率后，这一问题有所改善，但还是随着训练过程，扭曲问题会变得越来越严重 500次迭代：

000000.mp4 4000次迭代： 000000.mp4 期待您的回复~

看起来4000步的结果也还比较正常，请问这里说的扭曲问题具体是指什么呢？

一开始说的扭曲就是空间结构会有一些不合理。目前多训练了几天，刚刚测试了一下，看起来效果正常啦，感谢。

Aug 16 '24 07:08 CacacaLalala

想知道您使用了多少数据进行微调，推荐使用100条相似的视频，以及，您使用了默认配置吗，能提供一下loss的下降情况吗

感谢您的回复！我是想实现在您的模型权重基础上继续用其他数据进行训练的功能，所以我是在数据集中先随机抽取了50条视频。是默认配置，training_config如下： `args: checkpoint_activations: true model_parallel_size: 1 experiment_name: finetune-openvid-framesmin180-max500-origin-dataset mode: finetune load: CogVideoX-2b-sat/transformer no_load_rng: true train_iters: 10000 eval_iters: 1 eval_interval: 10000 eval_batch_size: 1 save: output save_interval: 100 log_interval: 20 train_data:

dataset/mini_dataset/cogvideo/videos valid_data:

dataset/mini_dataset/cogvideo/videos split: 1,0,0 num_workers: 8 force_train: true only_log_video_latents: true data: target: data_video.SFTDataset params: video_size:

480

720 fps: 8 max_num_frames: 49 skip_frms_num: 3.0 deepspeed: train_micro_batch_size_per_gpu: 1 gradient_accumulation_steps: 1 steps_per_print: 50 gradient_clipping: 0.1 zero_optimization: stage: 2 cpu_offload: false contiguous_gradients: false overlap_comm: true reduce_scatter: true reduce_bucket_size: 1000000000 allgather_bucket_size: 1000000000 load_from_fp32_weights: false zero_allow_untested_optimizer: true bf16: enabled: false fp16: enabled: true loss_scale: 0 loss_scale_window: 400 hysteresis: 2 min_loss_scale: 1 optimizer: type: sat.ops.FusedEmaAdam params: lr: 0.0002 betas:

0.9

0.95 eps: 1.0e-08 weight_decay: 0.0001 activation_checkpointing: partition_activations: false contiguous_memory_optimization: false wall_clock_breakdown: false model: scale_factor: 1.15258426 disable_first_stage_autocast: true log_keys:

txt denoiser_config: target: sgm.modules.diffusionmodules.denoiser.DiscreteDenoiser params: num_idx: 1000 quantize_c_noise: false weighting_config: target: sgm.modules.diffusionmodules.denoiser_weighting.EpsWeighting scaling_config: target: sgm.modules.diffusionmodules.denoiser_scaling.VideoScaling discretization_config: target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization params: shift_scale: 3.0 network_config: target: dit_video_concat.DiffusionTransformer params: time_embed_dim: 512 elementwise_affine: true num_frames: 49 time_compressed_rate: 4 latent_width: 90 latent_height: 60 num_layers: 30 patch_size: 2 in_channels: 16 out_channels: 16 hidden_size: 1920 adm_in_channels: 256 num_attention_heads: 30 transformer_args: checkpoint_activations: true vocab_size: 1 max_sequence_length: 64 layernorm_order: pre skip_init: false model_parallel_size: 1 is_decoder: false modules: pos_embed_config: target: dit_video_concat.Basic3DPositionEmbeddingMixin params: text_length: 226 height_interpolation: 1.875 width_interpolation: 1.875 patch_embed_config: target: dit_video_concat.ImagePatchEmbeddingMixin params: text_hidden_size: 4096 adaln_layer_config: target: dit_video_concat.AdaLNMixin params: qk_ln: true final_layer_config: target: dit_video_concat.FinalLayerMixin conditioner_config: target: sgm.modules.GeneralConditioner params: emb_models:

is_trainable: false input_key: txt ucg_rate: 0.1 target: sgm.modules.encoders.modules.FrozenT5Embedder params: model_dir: ckpts/cogvideo/t5-v1_1-xxl max_length: 226 first_stage_config: target: vae_modules.autoencoder.VideoAutoencoderInferenceWrapper params: cp_size: 1 ckpt_path: CogVideoX-2b-sat/vae/3d-vae.pt ignore_keys:

loss loss_config: target: torch.nn.Identity regularizer_config: target: vae_modules.regularizers.DiagonalGaussianRegularizer encoder_config: target: vae_modules.cp_enc_dec.ContextParallelEncoder3D params: double_z: true z_channels: 16 resolution: 256 in_channels: 3 out_ch: 3 ch: 128 ch_mult:

1

2

2

4 attn_resolutions: [] num_res_blocks: 3 dropout: 0.0 gather_norm: true decoder_config: target: vae_modules.cp_enc_dec.ContextParallelDecoder3D params: double_z: true z_channels: 16 resolution: 256 in_channels: 3 out_ch: 3 ch: 128 ch_mult:

1

2

2

4 attn_resolutions: [] num_res_blocks: 3 dropout: 0.0 gather_norm: false loss_fn_config: target: sgm.modules.diffusionmodules.loss.VideoDiffusionLoss params: offset_noise_level: 0 sigma_sampler_config: target: sgm.modules.diffusionmodules.sigma_sampling.DiscreteSampling params: uniform_sampling: true num_idx: 1000 discretization_config: target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization params: shift_scale: 3.0 sampler_config: target: sgm.modules.diffusionmodules.sampling.VPSDEDPMPP2MSampler params: num_steps: 50 verbose: true discretization_config: target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization params: shift_scale: 3.0 guider_config: target: sgm.modules.diffusionmodules.guiders.DynamicCFG params: scale: 6 exp: 5 num_steps: 50 ` 不好意思我目前还没对repo做过多修改，loss还没有记录下来您在使用full parameter微调时有观察到这种空间扭曲的问题吗？我尝试降低学习率后，这一问题有所改善，但还是随着训练过程，扭曲问题会变得越来越严重 500次迭代：

000000.mp4 4000次迭代： 000000.mp4 期待您的回复~

看起来4000步的结果也还比较正常，请问这里说的扭曲问题具体是指什么呢？

一开始说的扭曲就是空间结构会有一些不合理。目前多训练了几天，刚刚测试了一下，看起来效果正常啦，感谢。

我也在尝试finetune，所以空间结构不合理的问题是靠调小学习率 + 长时间训练解决的么？

Aug 16 '24 08:08 GFENGG

理的问题是靠调小学习率 + 长时间训练解

目前看是这样

Aug 19 '24 07:08 CacacaLalala

Hey everyone! I have a few questions on finetuning that I would love if you could answer:

Is a dataset size of 50-100 videos okay for teaching the model a single concept? Can we go lower?
How many total training steps are required for convergence assuming I have 50 videos using training batch size of 1? Do we really need 4000+ steps?
What initialization works best with LoRA layers? Is the default A = kaiming_uniform, B = 0 the best? Can we use gaussian or different initialization supported in libraries like peft.
Do we need the FusedEmaAdam implementation? Do we need EMA at all? Is simple torch.nn.Adam okay for training?
Even after a somewhat successful training run, results for prompts that the model was finetuned on are okay but for any other prompt, I get weird looking and artifacted outputs
How much memory is required to finetune the 5B model? Is it possible to do on a single A100 GPU? If not, what can be optimized? I've tried VAE slicing and tiling but it still OOMs even with training batch size of 1.
Has anyone successfully trained a LoRA with lower rank than 128 producing good results?
What training batch size are you able to use comfortably on a single 80 GB GPU when finetuning the 2B model?
Any tips/techniques on speeding up training?

Thanks to everyone in advance! I might bother you with some more questions

Sep 03 '24 12:09 a-r-r-o-w

Hey everyone! I have a few questions on finetuning that I would love if you could answer:

Is a dataset size of 50-100 videos okay for teaching the model a single concept? Can we go lower?

How many total training steps are required for convergence assuming I have 50 videos using training batch size of 1? Do we really need 4000+ steps?

What initialization works best with LoRA layers? Is the default A = kaiming_uniform, B = 0 the best? Can we use gaussian or different initialization supported in libraries like peft.

Do we need the FusedEmaAdam implementation? Do we need EMA at all? Is simple torch.nn.Adam okay for training?

Even after a somewhat successful training run, results for prompts that the model was finetuned on are okay but for any other prompt, I get weird looking and artifacted outputs

How much memory is required to finetune the 5B model? Is it possible to do on a single A100 GPU? If not, what can be optimized? I've tried VAE slicing and tiling but it still OOMs even with training batch size of 1.

Has anyone successfully trained a LoRA with lower rank than 128 producing good results?

What training batch size are you able to use comfortably on a single 80 GB GPU when finetuning the 2B model?

Any tips/techniques on speeding up training?

Thanks to everyone in advance! I might bother you with some more questions

@a-r-r-o-w Have you got any answers? I'm also very curious about.

Nov 30 '24 13:11 rainbow979

Hey, yes I do! We worked together with Yuxuan from the CogVideoX team here: https://github.com/a-r-r-o-w/cogvideox-factory

50+ videos is great for finetuning. I generally use ~200 for my experiments to have more diversity
2500+ steps is usually enough for teaching a specific style. After speaking with others using cog-factory, it looks like 6000-20000 steps is good for teaching new characters/concepts. The longest finetune I know of is 40000 steps (but not public) on movie-like high quality data for CogVideoX-Fun using customized cog-factory script, which turned out very promising
Initialization does not seem to have much effect. peft defaults are great
Any decent optimizer works well. AdamW is my go-to but I have also tried the recent ADOPT, which works well too
The artifact issue was because of a bug in the Diffusers training scripts, which should have been addressed in cog-factory by now
We can finetune in less than 24 GB and batch_size > 1 with TorchAO low-bit optimizers/model quantization + gradient offloading, or DeepSpeed!
Yes, lora with rank 32 and above works. Need to make sure that lora alpha is half of rank or above atleast (for the diffusers scripts. I'm not sure about the recommendations in SAT, so you can open a separate issue if interested)
For 80 GB, if using memory optimizations like precomputing latents/embeddings, optimizer states offloading, gradient checkpointing, gradient offloading, you can go upto 6-8 batch size on a single GPU
Torch compile with dynamic shapes helps in speeding up training a bit. The cog-factory scripts have not particularly been profiled for improvements yet, so could be slow. Precomputing latents/embeddings really helps a lot with speeding things up since you only have to load tensors directly without any further preprocessing, and don't have the overhead of calculating the same embeddings every epoch. It also means that you can get rid of the text encoder and vae during training to save some additional memory

Let me know if I can help you with anything else!

Nov 30 '24 13:11 a-r-r-o-w

Thanks a lot for replying! They are very helpful. I have one more question: so we don't need to use EMA model to train?

Dec 01 '24 17:12 rainbow979

I think there was a good recent paper that showed EMA is not particularly helpful for LoRA training, but the results with it are not too qualitatively different without it. It's really hard to see any benefits on small scale runs atleast (<10k steps in my tests), given the added memory requirement

Dec 01 '24 18:12 a-r-r-o-w

The artifact issue was because of a bug in the Diffusers training scripts, which should have been addressed in cog-factory by now

@a-r-r-o-w hi, Can you tell us more about which bug is causing the problem?

Dec 05 '24 09:12 crj1998

Hey, yes I do! We worked together with Yuxuan from the CogVideoX team here: https://github.com/a-r-r-o-w/cogvideox-factory

50+ videos is great for finetuning. I generally use ~200 for my experiments to have more diversity

2500+ steps is usually enough for teaching a specific style. After speaking with others using cog-factory, it looks like 6000-20000 steps is good for teaching new characters/concepts. The longest finetune I know of is 40000 steps (but not public) on movie-like high quality data for CogVideoX-Fun using customized cog-factory script, which turned out very promising

Initialization does not seem to have much effect. peft defaults are great

Any decent optimizer works well. AdamW is my go-to but I have also tried the recent ADOPT, which works well too

The artifact issue was because of a bug in the Diffusers training scripts, which should have been addressed in cog-factory by now

We can finetune in less than 24 GB and batch_size > 1 with TorchAO low-bit optimizers/model quantization + gradient offloading, or DeepSpeed!

Yes, lora with rank 32 and above works. Need to make sure that lora alpha is half of rank or above atleast (for the diffusers scripts. I'm not sure about the recommendations in SAT, so you can open a separate issue if interested)

For 80 GB, if using memory optimizations like precomputing latents/embeddings, optimizer states offloading, gradient checkpointing, gradient offloading, you can go upto 6-8 batch size on a single GPU

Torch compile with dynamic shapes helps in speeding up training a bit. The cog-factory scripts have not particularly been profiled for improvements yet, so could be slow. Precomputing latents/embeddings really helps a lot with speeding things up since you only have to load tensors directly without any further preprocessing, and don't have the overhead of calculating the same embeddings every epoch. It also means that you can get rid of the text encoder and vae during training to save some additional memory

Let me know if I can help you with anything else!

@a-r-r-o-w hi, Can you tell us more about which bug is causing the problem? (The artifact issue was because of a bug in the Diffusers training scripts, which should have been addressed in cog-factory by now)

Dec 11 '24 16:12 crj1998

CogVideo CogVideo copied to clipboard

Full-parameter finetune后，生成视频的主体空间扭曲

CogVideo
CogVideo copied to clipboard