accelerate icon indicating copy to clipboard operation
accelerate copied to clipboard

lr_scheduler step once in code, but in every process lr_scheduler step 4 times (when using 4 gpus) why?

Open efsotr opened this issue 2 years ago • 4 comments

accelerate config:

- `Accelerate` version: 0.16.0
- Platform: Linux-4.15.0-208-generic-x86_64-with-glibc2.27
- Python version: 3.9.16
- Numpy version: 1.23.5
- PyTorch version (GPU?): 1.10.0 (True)
- `Accelerate` config passed:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: MULTI_GPU
        - mixed_precision: no
        - use_cpu: False
        - dynamo_backend: NO
        - num_processes: 4
        - machine_rank: 0
        - num_machines: 1
        - gpu_ids: 4,5,6,7
        - main_process_port: 29504
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - deepspeed_config: {}
        - fsdp_config: {}
        - megatron_lm_config: {}
        - downcast_bf16: no

code:

import torch
from torch import nn, tensor
from torch.optim import AdamW
from torch.optim.lr_scheduler import LambdaLR
from accelerate import Accelerator

accelerator = Accelerator()

class lambda1:
	def __init__(self):
		pass

	def __call__(self, _step):
		print('lr_scheduler:' + str(_step * 10 + accelerator.process_index)+'\n', flush=True)
		return _step * 10 + accelerator.process_index

model = nn.Linear(1, 1)
optim = AdamW(model.parameters(), lr=1)
lr_scheduler = LambdaLR(optimizer=optim, lr_lambda=lambda1())

model, optim, lr_scheduler = accelerator.prepare(model, optim, lr_scheduler)

lr_scheduler.step()

print('optim:' + str(optim.param_groups[0]['lr'])+'\n', flush=True)

output:

lr_scheduler:1

lr_scheduler:2

lr_scheduler:0

lr_scheduler:3

lr_scheduler:12

lr_scheduler:22

lr_scheduler:32

lr_scheduler:42

optim:42

lr_scheduler:13

lr_scheduler:23

lr_scheduler:33

lr_scheduler:43

optim:43

lr_scheduler:10

lr_scheduler:11

lr_scheduler:20

lr_scheduler:21
lr_scheduler:30


lr_scheduler:31

lr_scheduler:40

optim:40

lr_scheduler:41

optim:41

efsotr avatar May 08 '23 02:05 efsotr

That's because with 4 GPUs you have a batch size 4 times bigger so a number of total training steps 4 times smaller.

sgugger avatar May 08 '23 11:05 sgugger

I don't think that is great default behaviour. We account for the learning rate and the batch size when setting up a distributed job, now we need to multiply the number of steps by the number of processes before constructing the scheduler to make sure it behaves as we expect it to do, while still having the benefit of checking to see if the optimizer step actually happened in mixed precision.

Craigacp avatar May 19 '23 18:05 Craigacp

If you account for everything yourself, then you don't need to use Accelerate :-)

sgugger avatar May 19 '23 18:05 sgugger

Considering that the lr scheduler step 4 times with 4 GPUs, it seems logical that the global step should also be updated 4 times to ensure that the total training steps are reduced by a factor of 4. However, in my practical experience using 4 GPUs, I've noticed that my cosine scheduled learning rate reaches 0 at step 10000, while the total training steps are set to 40000. I wonder if I miss something in the accelerator setting. Or it could be necessary to adjust either the lr schedule's total steps or the global step updates per pass to align the lr scheduler step and the global step.

lixirui142 avatar May 21 '23 09:05 lixirui142

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Jun 14 '23 15:06 github-actions[bot]