accelerate icon indicating copy to clipboard operation
accelerate copied to clipboard

Need to do distributed training using 2 separate machines

Open Sreyashi-Bhattacharjee opened this issue 2 years ago • 2 comments

Hello, I am trying to do distributed training using 2 separate machines. Can anyone please guide me towards any tutorial / demo on this? The configs created using accelerate are:

Machine 1:

compute_environment: LOCAL_MACHINE deepspeed_config: {} distributed_type: MULTI_CPU fsdp_config: {} machine_rank: 0 main_process_ip: 20.160.27.77 main_process_port: 8080 main_training_function: main mixed_precision: fp16 num_machines: 2 num_processes: 2 use_cpu: true

Machine 2:

compute_environment: LOCAL_MACHINE deepspeed_config: {} distributed_type: MULTI_CPU fsdp_config: {} machine_rank: 1 main_process_ip: 20.160.27.77 main_process_port: 8080 main_training_function: main mixed_precision: fp16 num_machines: 2 num_processes: 2 use_cpu: true

Then I ran both the launch commands , they started training separately. Then I ran only the main machine , which again started training on it's own. I am not able to get any concrete direction on this.

Sreyashi-Bhattacharjee avatar Dec 16 '22 07:12 Sreyashi-Bhattacharjee

@Sreyashi-Bhattacharjee this is currently unsupported yet for multi-CPU, changing to a FR and adding it to our timetable

muellerzr avatar Jan 17 '23 15:01 muellerzr

Any update on this?

jav-ed avatar Jan 11 '24 11:01 jav-ed