diffusers icon indicating copy to clipboard operation
diffusers copied to clipboard

Training with gradient_accumulation_steps=16, max_train_steps not stopping the training at defined max_train_steps

Open a-l-e-x-d-s-9 opened this issue 1 year ago • 0 comments

Describe the bug

I'm training locally with 8GB vram card. To make training faster, I have changed in the settings, gradient_accumulation_steps from 1 to 16, also I run "accelerate config" and changed the value there. But now when I run the script with the settings:

 accelerate launch --mixed_precision="fp16" train_dreambooth.py \
  --pretrained_model_name_or_path="$MODEL_NAME"  \
  --instance_data_dir="$INSTANCE_DIR" \
  --output_dir="$OUTPUT_DIR" \
  --instance_prompt="audra miller" \
  --resolution=512 \
  --train_batch_size=1 \
  --sample_batch_size=1 \
  --gradient_accumulation_steps=16 --gradient_checkpointing \
  --learning_rate=4e-6 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --max_train_steps=2800 \
  --save_interval 300 \
  --save_min_steps 1000

The script doesn't stop at 2800 steps, it continue to run over 3600 steps, and I had to terminate it manually. Checkpoint generated fine.

Reproduction

Set gradient_accumulation_steps=16 Steps number = training images multiplied by 100. Not stopping at the defined number of steps.

Logs

Steps: : 3663it [44:35,  1.85it/s, loss=0.145, lr=4e-6][2023-02-07 12:41:03,231] [INFO] [timer.py:197:stop] 0/3664, RunningAvgSamplesPerSec=1.3890020784507544, CurrSamplesPerSec=0.2603498963207169, MemAllocated=1.67GB, MaxMemAllocated=4.91GB

System Info

  • diffusers version: 0.12.1
  • Platform: Linux-5.15.0-58-generic-x86_64-with-glibc2.35
  • Python version: 3.10.6
  • PyTorch version (GPU?): 1.13.1+cu117 (True)
  • Huggingface_hub version: 0.11.1
  • Transformers version: 0.16.0
  • Accelerate version: not installed
  • xFormers version: 0.0.16
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:

a-l-e-x-d-s-9 avatar Feb 07 '23 11:02 a-l-e-x-d-s-9