diffusers
diffusers copied to clipboard
Training with gradient_accumulation_steps=16, max_train_steps not stopping the training at defined max_train_steps
Describe the bug
I'm training locally with 8GB vram card. To make training faster, I have changed in the settings, gradient_accumulation_steps from 1 to 16, also I run "accelerate config" and changed the value there. But now when I run the script with the settings:
accelerate launch --mixed_precision="fp16" train_dreambooth.py \
--pretrained_model_name_or_path="$MODEL_NAME" \
--instance_data_dir="$INSTANCE_DIR" \
--output_dir="$OUTPUT_DIR" \
--instance_prompt="audra miller" \
--resolution=512 \
--train_batch_size=1 \
--sample_batch_size=1 \
--gradient_accumulation_steps=16 --gradient_checkpointing \
--learning_rate=4e-6 \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--max_train_steps=2800 \
--save_interval 300 \
--save_min_steps 1000
The script doesn't stop at 2800 steps, it continue to run over 3600 steps, and I had to terminate it manually. Checkpoint generated fine.
Reproduction
Set gradient_accumulation_steps=16 Steps number = training images multiplied by 100. Not stopping at the defined number of steps.
Logs
Steps: : 3663it [44:35, 1.85it/s, loss=0.145, lr=4e-6][2023-02-07 12:41:03,231] [INFO] [timer.py:197:stop] 0/3664, RunningAvgSamplesPerSec=1.3890020784507544, CurrSamplesPerSec=0.2603498963207169, MemAllocated=1.67GB, MaxMemAllocated=4.91GB
System Info
-
diffusers
version: 0.12.1 - Platform: Linux-5.15.0-58-generic-x86_64-with-glibc2.35
- Python version: 3.10.6
- PyTorch version (GPU?): 1.13.1+cu117 (True)
- Huggingface_hub version: 0.11.1
- Transformers version: 0.16.0
- Accelerate version: not installed
- xFormers version: 0.0.16
- Using GPU in script?:
- Using distributed or parallel set-up in script?: