accelerate
accelerate copied to clipboard
Need to do distributed training using 2 separate machines
Hello, I am trying to do distributed training using 2 separate machines. Can anyone please guide me towards any tutorial / demo on this? The configs created using accelerate are:
Machine 1:
compute_environment: LOCAL_MACHINE deepspeed_config: {} distributed_type: MULTI_CPU fsdp_config: {} machine_rank: 0 main_process_ip: 20.160.27.77 main_process_port: 8080 main_training_function: main mixed_precision: fp16 num_machines: 2 num_processes: 2 use_cpu: true
Machine 2:
compute_environment: LOCAL_MACHINE deepspeed_config: {} distributed_type: MULTI_CPU fsdp_config: {} machine_rank: 1 main_process_ip: 20.160.27.77 main_process_port: 8080 main_training_function: main mixed_precision: fp16 num_machines: 2 num_processes: 2 use_cpu: true
Then I ran both the launch commands , they started training separately. Then I ran only the main machine , which again started training on it's own. I am not able to get any concrete direction on this.
@Sreyashi-Bhattacharjee this is currently unsupported yet for multi-CPU, changing to a FR and adding it to our timetable
Any update on this?