accelerate
accelerate copied to clipboard
lr_scheduler step once in code, but in every process lr_scheduler step 4 times (when using 4 gpus) why?
accelerate config:
- `Accelerate` version: 0.16.0
- Platform: Linux-4.15.0-208-generic-x86_64-with-glibc2.27
- Python version: 3.9.16
- Numpy version: 1.23.5
- PyTorch version (GPU?): 1.10.0 (True)
- `Accelerate` config passed:
- compute_environment: LOCAL_MACHINE
- distributed_type: MULTI_GPU
- mixed_precision: no
- use_cpu: False
- dynamo_backend: NO
- num_processes: 4
- machine_rank: 0
- num_machines: 1
- gpu_ids: 4,5,6,7
- main_process_port: 29504
- rdzv_backend: static
- same_network: True
- main_training_function: main
- deepspeed_config: {}
- fsdp_config: {}
- megatron_lm_config: {}
- downcast_bf16: no
code:
import torch
from torch import nn, tensor
from torch.optim import AdamW
from torch.optim.lr_scheduler import LambdaLR
from accelerate import Accelerator
accelerator = Accelerator()
class lambda1:
def __init__(self):
pass
def __call__(self, _step):
print('lr_scheduler:' + str(_step * 10 + accelerator.process_index)+'\n', flush=True)
return _step * 10 + accelerator.process_index
model = nn.Linear(1, 1)
optim = AdamW(model.parameters(), lr=1)
lr_scheduler = LambdaLR(optimizer=optim, lr_lambda=lambda1())
model, optim, lr_scheduler = accelerator.prepare(model, optim, lr_scheduler)
lr_scheduler.step()
print('optim:' + str(optim.param_groups[0]['lr'])+'\n', flush=True)
output:
lr_scheduler:1
lr_scheduler:2
lr_scheduler:0
lr_scheduler:3
lr_scheduler:12
lr_scheduler:22
lr_scheduler:32
lr_scheduler:42
optim:42
lr_scheduler:13
lr_scheduler:23
lr_scheduler:33
lr_scheduler:43
optim:43
lr_scheduler:10
lr_scheduler:11
lr_scheduler:20
lr_scheduler:21
lr_scheduler:30
lr_scheduler:31
lr_scheduler:40
optim:40
lr_scheduler:41
optim:41
That's because with 4 GPUs you have a batch size 4 times bigger so a number of total training steps 4 times smaller.
I don't think that is great default behaviour. We account for the learning rate and the batch size when setting up a distributed job, now we need to multiply the number of steps by the number of processes before constructing the scheduler to make sure it behaves as we expect it to do, while still having the benefit of checking to see if the optimizer step actually happened in mixed precision.
If you account for everything yourself, then you don't need to use Accelerate :-)
Considering that the lr scheduler step 4 times with 4 GPUs, it seems logical that the global step should also be updated 4 times to ensure that the total training steps are reduced by a factor of 4. However, in my practical experience using 4 GPUs, I've noticed that my cosine scheduled learning rate reaches 0 at step 10000, while the total training steps are set to 40000. I wonder if I miss something in the accelerator setting. Or it could be necessary to adjust either the lr schedule's total steps or the global step updates per pass to align the lr scheduler step and the global step.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.