diffusers
diffusers copied to clipboard
Wrong learning rate scheduler training step count for examples with multi-gpu when setting `--num_train_epochs`
Describe the bug
I think there are still some problems with the learning rate scheduler. This is resolved when you set --max_train_steps, as discussed in #3954 , but not completely.
For example, the code snippet https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image.py#L816-L833 . I paste it here:
# Scheduler and math around the number of training steps.
overrode_max_train_steps = False
num_update_steps_per_epoch = math.ceil(len(train_dataloader) / args.gradient_accumulation_steps)
if args.max_train_steps is None:
args.max_train_steps = args.num_train_epochs * num_update_steps_per_epoch
overrode_max_train_steps = True
lr_scheduler = get_scheduler(
args.lr_scheduler,
optimizer=optimizer,
num_warmup_steps=args.lr_warmup_steps * accelerator.num_processes,
num_training_steps=args.max_train_steps * accelerator.num_processes,
)
# Prepare everything with our `accelerator`.
unet, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
unet, optimizer, train_dataloader, lr_scheduler
)
When setting --num_train_epochs instead of --max_train_steps, the calculation of num_update_steps_per_epoch is incorrect because train_dataloader has not yet been wrapped by accelerator.prepare. Consequently, args.max_train_steps is roughly num_processes times the actual value. This discrepancy leads to unintended values being passed into the get_scheduler function.
In fact, the logic here is quite confusing. It seems like a refactoring might be necessary.
Reproduction
accelerate launch --mixed_precision="fp16" train_text_to_image.py \
...
- --max_train_steps=15000 \
+ --num_train_epochs=100 \
...
Logs
No response
System Info
diffusersversion: 0.27.2- Platform: macOS-10.16-x86_64-i386-64bit
- Python version: 3.9.17
- PyTorch version (GPU?): 2.0.1 (False)
- Huggingface_hub version: 0.20.3
- Transformers version: 4.30.0
- Accelerate version: 0.21.0
- xFormers version: not installed
- Using GPU in script?: no
- Using distributed or parallel set-up in script?: no
Who can help?
@sayakpaul @yiyixuxu @eliphatfs
Consequently, args.max_train_steps is roughly num_processes times the actual value.
Could you explain as to why do you think this is the case?
You seemed to have an idea of what you like the code block to look like. So, if you want to take a stab at PR, happy to review that too.
Let's make a quick assumption:
- length of dataset: 8
- batch size: 1
- gradient accumulation steps: 1
- number of gpus(
num_processes): 2 - number of epochs(
num_train_epochs): 1 max_train_steps(do not set): None
Before we do accelerator.prepare on the train_dataloader, it is created in the standalone training way, not the distributed training way. So, the len(train_dataloader)=8 ==> num_update_steps_per_epoch=8 ==> args.max_train_steps=8. However, we expect args.max_train_steps=4, right?
Thank you. Would you be interested in a PR to fix this?
I would do some more detailed verification first. Also, I hope @eliphatfs can help confirm that the issues mentioned above are correct.
I tested that the scripts were working for step-based training. For epoch-based training I do think num_update_steps_per_epoch should be divided by the number of processes -- this value appears incorrect.
Cool. I think we're clear on the bug. I will open a pr to fix the issue asap.
the length of the dataloader should reveal the number of batches and not the number of samples. are we doing that here?