CogVideo icon indicating copy to clipboard operation
CogVideo copied to clipboard

train lora on cogvideox1.5-i2v-5b OOM

Open Vickeyhw opened this issue 5 months ago • 1 comments

System Info / 系統信息

OOM on 8 A800 40G GPUs , i use the train_ddp_i2v.sh

`ATA_ARGS=( --data_root "videos_perpromptunion_0718" --caption_column "prompt.txt" --video_column "videos.txt" # --image_column "images.txt" # comment this line will use first frame of video as image conditioning --train_resolution "81x768x1360" # (frames x height x width), frames should be 8N+1 )

Training Configuration

TRAIN_ARGS=( --train_epochs 2 # number of training epochs --seed 42 # random seed --batch_size 1 --gradient_accumulation_steps 1 --mixed_precision "bf16" # ["no", "fp16"] # Only CogVideoX-2B supports fp16 training )

System Configuration

SYSTEM_ARGS=( --num_workers 8 --pin_memory True --nccl_timeout 1800 )

Checkpointing Configuration

CHECKPOINT_ARGS=( --checkpointing_steps 50 # save checkpoint every x steps --checkpointing_limit 2 # maximum number of checkpoints to keep, after which the oldest one is deleted # --resume_from_checkpoint "/absolute/path/to/checkpoint_dir" # if you want to resume from a checkpoint, otherwise, comment this line )

Validation Configuration

VALIDATION_ARGS=( --do_validation false # ["true", "false"] --validation_dir "/absolute/path/to/your/validation_set" --validation_steps 20 # should be multiple of checkpointing_steps --validation_prompts "prompts.txt" --validation_images "images.txt" --gen_fps 16 )`

Information / 问题信息

  • [x] The official example scripts / 官方的示例脚本
  • [ ] My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

Image

Expected behavior / 期待表现

如果使用zero训练能解决这个问题吗

Vickeyhw avatar Jul 23 '25 02:07 Vickeyhw

@zRzRzRzRzRzRzR 换zero2能在40G卡上训起来,但速度非常慢,40s/iter,是正常的吗?

Vickeyhw avatar Jul 23 '25 02:07 Vickeyhw