CogVideo
CogVideo copied to clipboard
Fully fine-tune config for CogiVideo-5B model
Feature request / 功能建议
Thanks for open-sourcing such an amazing model and codebase. I would like to ask whether it is possible to open-source the pre-training or fully fine-tune configs for CogVideo-5B model.
Motivation / 动机
In my task, I would like to do fully fine-tuning instead of LoRA fine-tuning. However, I tried to comment LoRA module and got OOM error. It would be very helpful to the research community if authors could share their configuration for pre-training and fully fine-tuning.
Your contribution / 您的贡献
N/A
Same problems with fully fine-tuning the 5B model, have you solved it? @xvjiarui
I would like to know what specific issue you encountered. Can you provide the error log?
Hi @zRzRzRzRzRzRzR Thanks for your reply.
This is the model config I am using. I modified the LoRA config you provided. To enable full model training, I deleted not_trainable_prefixes and lora_config.
args:
checkpoint_activations: True ## using gradient checkpointing
model_parallel_size: 1
# may need to set to pretrain if resuming from a pretraining checkpoint
mode: finetune
load: "checkpoints/CogVideoX-5b-sat/transformer"
no_load_rng: True
train_iters: 5000
eval_iters: 1
eval_interval: 10000
eval_batch_size: 1
save: output/train
save_interval: 100
log_interval: 1
force_train: True
only_log_video_latents: True
lr_decay_style: "constant"
resume: auto
deepspeed:
train_micro_batch_size_per_gpu: 1
gradient_accumulation_steps: 1
steps_per_print: 50
gradient_clipping: 0.1
zero_optimization:
stage: 2
cpu_offload: false
contiguous_gradients: false
overlap_comm: true
reduce_scatter: true
reduce_bucket_size: 1000000000
allgather_bucket_size: 1000000000
load_from_fp32_weights: false
zero_allow_untested_optimizer: true
bf16:
enabled: True
fp16:
enabled: False
loss_scale: 0
loss_scale_window: 400
hysteresis: 2
min_loss_scale: 1
optimizer:
type: sat.ops.FusedEmaAdam
params:
lr: 0.00001
betas: [0.9, 0.95]
eps: 1e-8
weight_decay: 1e-4
activation_checkpointing:
partition_activations: false
contiguous_memory_optimization: false
wall_clock_breakdown: false
model:
scale_factor: 0.7 # different from cogvideox_2b_infer.yaml
disable_first_stage_autocast: true
log_keys:
- txt
denoiser_config:
target: sgm.modules.diffusionmodules.denoiser.DiscreteDenoiser
params:
num_idx: 1000
quantize_c_noise: False
weighting_config:
target: sgm.modules.diffusionmodules.denoiser_weighting.EpsWeighting
scaling_config:
target: sgm.modules.diffusionmodules.denoiser_scaling.VideoScaling
discretization_config:
target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization
params:
shift_scale: 1.0 # different from cogvideox_2b_infer.yaml
network_config:
target: dit_video_concat.DiffusionTransformer
params:
time_embed_dim: 512
elementwise_affine: True
num_frames: 49
time_compressed_rate: 4
latent_width: 90
latent_height: 60
num_layers: 42 # different from cogvideox_2b_infer.yaml
patch_size: 2
in_channels: 16
out_channels: 16
hidden_size: 3072 # different from cogvideox_2b_infer.yaml
adm_in_channels: 256
num_attention_heads: 48 # different from cogvideox_2b_infer.yaml
transformer_args:
checkpoint_activations: True
vocab_size: 1
max_sequence_length: 64
layernorm_order: pre
skip_init: false
model_parallel_size: 1
is_decoder: false
modules:
pos_embed_config:
target: dit_video_concat.Rotary3DPositionEmbeddingMixin # different from cogvideox_2b_infer.yaml
params:
hidden_size_head: 64
text_length: 226
patch_embed_config:
target: dit_video_concat.ImagePatchEmbeddingMixin
params:
text_hidden_size: 4096
adaln_layer_config:
target: dit_video_concat.AdaLNMixin
params:
qk_ln: True
final_layer_config:
target: dit_video_concat.FinalLayerMixin
conditioner_config:
target: sgm.modules.GeneralConditioner
params:
emb_models:
- is_trainable: false
input_key: txt
ucg_rate: 0.1
target: sgm.modules.encoders.modules.FrozenT5Embedder
params:
model_dir: "checkpoints/t5-v1_1-xxl"
max_length: 226
first_stage_config:
target: vae_modules.autoencoder.VideoAutoencoderInferenceWrapper
params:
cp_size: 1
ckpt_path: "checkpoints/CogVideoX-5b-sat/vae/3d-vae.pt"
ignore_keys: [ 'loss' ]
loss_config:
target: torch.nn.Identity
regularizer_config:
target: vae_modules.regularizers.DiagonalGaussianRegularizer
encoder_config:
target: vae_modules.cp_enc_dec.ContextParallelEncoder3D
params:
double_z: true
z_channels: 16
resolution: 256
in_channels: 3
out_ch: 3
ch: 128
ch_mult: [ 1, 2, 2, 4 ]
attn_resolutions: [ ]
num_res_blocks: 3
dropout: 0.0
gather_norm: True
decoder_config:
target: vae_modules.cp_enc_dec.ContextParallelDecoder3D
params:
double_z: True
z_channels: 16
resolution: 256
in_channels: 3
out_ch: 3
ch: 128
ch_mult: [ 1, 2, 2, 4 ]
attn_resolutions: [ ]
num_res_blocks: 3
dropout: 0.0
gather_norm: False
loss_fn_config:
target: sgm.modules.diffusionmodules.loss.VideoDiffusionLoss
params:
offset_noise_level: 0
sigma_sampler_config:
target: sgm.modules.diffusionmodules.sigma_sampling.DiscreteSampling
params:
uniform_sampling: True
num_idx: 1000
discretization_config:
target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization
params:
shift_scale: 1.0 # different from cogvideox_2b_infer.yaml
sampler_config:
# target: sgm.modules.diffusionmodules.sampling.VPSDEDPMPP2MSampler
target: sgm.modules.diffusionmodules.sampling.VPODEDPMPP2MSampler
params:
num_steps: 50
verbose: True
discretization_config:
target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization
params:
shift_scale: 1.0 # different from cogvideox_2b_infer.yaml
guider_config:
target: sgm.modules.diffusionmodules.guiders.DynamicCFG
params:
scale: 6
exp: 5
num_steps: 50
And it will give Out of Cuda Memory error on A100.
[2024-09-04 10:39:53,486] [INFO] [checkpointing.py:787:non_reentrant_checkpoint] Activation Checkpointing Information
[2024-09-04 10:39:53,486] [INFO] [checkpointing.py:788:non_reentrant_checkpoint] ----Partition Activations False, CPU CHECKPOINTING False
[2024-09-04 10:39:53,486] [INFO] [checkpointing.py:789:non_reentrant_checkpoint] ----contiguous Memory Checkpointing False with None total layers
[2024-09-04 10:39:53,486] [INFO] [checkpointing.py:791:non_reentrant_checkpoint] ----Synchronization False
[2024-09-04 10:39:53,486] [INFO] [checkpointing.py:792:non_reentrant_checkpoint] ----Profiling time in checkpointing False
[rank0]: Traceback (most recent call last):
[rank0]: File "/lustre/fs2/portfolios/nvr/users/jiaruix/code/CogVideo/sat/train_video_oci.py", line 291, in <module>
[rank0]: main(parse_args())
[rank0]: File "/lustre/fs2/portfolios/nvr/users/jiaruix/code/CogVideo/sat/train_video_oci.py", line 280, in main
[rank0]: training_main(
[rank0]: File "/lustre/fs2/portfolios/nvr/users/jiaruix/code/CogVideo/sat/training/deepspeed_training.py", line 174, in training_main
[rank0]: iteration, skipped = train(model, optimizer,
[rank0]: File "/lustre/fs2/portfolios/nvr/users/jiaruix/code/CogVideo/sat/training/deepspeed_training.py", line 374, in train
[rank0]: lm_loss, skipped_iter, metrics = train_step(train_data_iterator,
[rank0]: File "/lustre/fs2/portfolios/nvr/users/jiaruix/code/CogVideo/sat/training/deepspeed_training.py", line 524, in train_step
[rank0]: model.step()
[rank0]: File "/lustre/fsw/portfolios/nvr/users/jiaruix/miniconda/envs/cogvideo/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2160, in step
[rank0]: self._take_model_step(lr_kwargs)
[rank0]: File "/lustre/fsw/portfolios/nvr/users/jiaruix/miniconda/envs/cogvideo/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2066, in _take_model_step
[rank0]: self.optimizer.step()
[rank0]: File "/lustre/fsw/portfolios/nvr/users/jiaruix/miniconda/envs/cogvideo/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1884, in step
[rank0]: int(self.partition_size[i])).to(self.single_partition_of_fp32_groups[i].dtype)
[rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.74 GiB. GPU 0 has a total capacity of 79.33 GiB of which 6.41 GiB is free. Including non-PyTorch memory, this process has 72.89 GiB memory in use. Of the allocated memory 61.18 GiB is allocated by PyTorch, and 10.38 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
E0904 10:40:06.176000 23456244159680 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 277177) of binary: /lustre/fsw/portfolios/nvr/users/jiaruix/miniconda/envs/cogvideo/bin/python
Traceback (most recent call last):
File "/lustre/fsw/portfolios/nvr/users/jiaruix/miniconda/envs/cogvideo/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch==2.4.0', 'console_scripts', 'torchrun')())
File "/lustre/fsw/portfolios/nvr/users/jiaruix/miniconda/envs/cogvideo/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
return f(*args, **kwargs)
File "/lustre/fsw/portfolios/nvr/users/jiaruix/miniconda/envs/cogvideo/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/lustre/fsw/portfolios/nvr/users/jiaruix/miniconda/envs/cogvideo/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/lustre/fsw/portfolios/nvr/users/jiaruix/miniconda/envs/cogvideo/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/lustre/fsw/portfolios/nvr/users/jiaruix/miniconda/envs/cogvideo/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
It seems the issue lies in the frame rate settings. Have you tried fine-tuning with the unmodified config? It already requires 72GB, which means that any increase in length will definitely exceed the memory limit. The code does not use TP or PP, so this code is already at its maximum capacity.
Hi @zRzRzRzRzRzRzR I tried LoRA config it works. I didn't modify any frame rate setting or number of frames.
Do you have any idea how to make full fine-tuning work in SAT codebase? When you train CogVideoX-5B, did you use any TP or PP? Does the released codebase support training with all transformer blocks unfrozen?
Same OOM error. I use the default cogvideox_5b.yaml that you provided.
发生异常: OutOfMemoryError (note: full exception trace is shown but execution is paused at: _run_module_as_main)
CUDA out of memory. Tried to allocate 20.74 GiB. GPU 0 has a total capacity of 79.44 GiB of which 3.03 GiB is free. Process 1692383 has 26.06 GiB memory in use. Including non-PyTorch memory, this process has 50.34 GiB memory in use. Of the allocated memory 49.70 GiB is allocated by PyTorch, and 47.74 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
File "/data1/anaconda3/envs/cogvideo/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 394, in __init__
self.device).clone().float().detach()
File "/data1/anaconda3/envs/cogvideo/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1507, in _configure_zero_optimizer
optimizer = DeepSpeedZeroOptimizer(
File "/data1/anaconda3/envs/cogvideo/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1249, in _configure_optimizer
self.optimizer = self._configure_zero_optimizer(basic_optimizer)
File "/data1/anaconda3/envs/cogvideo/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 306, in __init__
self._configure_optimizer(optimizer, model_parameters)
File "/data1/anaconda3/envs/cogvideo/lib/python3.10/site-packages/deepspeed/__init__.py", line 181, in initialize
engine = DeepSpeedEngine(args=args,
File "/data1/anaconda3/envs/cogvideo/lib/python3.10/site-packages/sat/training/deepspeed_training.py", line 239, in setup_model_untrainable_params_and_optimizer
model, optimizer, _, _ = deepspeed.initialize(
File "/data1/anaconda3/envs/cogvideo/lib/python3.10/site-packages/sat/training/deepspeed_training.py", line 130, in training_main
model, optimizer = setup_model_untrainable_params_and_optimizer(args, model)
File "/data3/cx_workspace/Proj_Emo+CogV/CogVideo/sat/train_video.py", line 226, in <module>
training_main(
File "/data1/anaconda3/envs/cogvideo/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/data1/anaconda3/envs/cogvideo/lib/python3.10/runpy.py", line 196, in _run_module_as_main (Current frame)
return _run_code(code, main_globals, None,
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.74 GiB. GPU 0 has a total capacity of 79.44 GiB of which 3.03 GiB is free. Process 1692383 has 26.06 GiB memory in use. Including non-PyTorch memory, this process has 50.34 GiB memory in use. Of the allocated memory 49.70 GiB is allocated by PyTorch, and 47.74 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
No, we are currently fine-tuning the diffuser version, but this may take some time. Full fine-tuning with SAT requires multiple A100 machines to reduce it to 80G per card using the zero3 solution, which is still a significant overhead at the moment. We are trying to optimize, for example by adding cp to the encoder part, which the diffusers team is currently attempting. We are very grateful for their support
Just to clarify, using SAT:
- Can we full finetune (using
cogvideox_5b.yaml) on 8 80G A100? - If not, can we lora finetune (using
cogvideox_5b_lora.yaml) on 8 80G A100? Thanks!
- 16 cards can be fully parameter trained
- 8 cards can be fine-tuned
2. 8 cards can be fine-tuned
For 2, would you wanna mean 8 cards can be enough for fine-tuning with lora?
yes
https://github.com/a-r-r-o-w/cogvideox-factory Try this framwork