accelerate
accelerate copied to clipboard
Can't use accelerate to launch two programs on one machine
The first program is running with config 1. When I launch the second program with config 2, the second program exits without even providing error information.
config 1:
- `Accelerate` version: 0.16.0
- Platform: Linux-4.15.0-208-generic-x86_64-with-glibc2.27
- Python version: 3.9.16
- Numpy version: 1.23.5
- PyTorch version (GPU?): 1.10.0 (True)
- `Accelerate` default config:
- compute_environment: LOCAL_MACHINE
- distributed_type: MULTI_GPU
- mixed_precision: no
- use_cpu: False
- dynamo_backend: NO
- num_processes: 4
- machine_rank: 0
- num_machines: 1
- gpu_ids: 0,1,2,3
- rdzv_backend: static
- same_network: True
- main_training_function: main
- deepspeed_config: {}
- fsdp_config: {}
- megatron_lm_config: {}
- downcast_bf16: no
config2:
- `Accelerate` version: 0.16.0
- Platform: Linux-4.15.0-208-generic-x86_64-with-glibc2.27
- Python version: 3.9.16
- Numpy version: 1.23.5
- PyTorch version (GPU?): 1.10.0 (True)
- `Accelerate` config passed:
- compute_environment: LOCAL_MACHINE
- distributed_type: MULTI_GPU
- mixed_precision: no
- use_cpu: False
- dynamo_backend: NO
- num_processes: 2
- machine_rank: 0
- num_machines: 1
- gpu_ids: 6,7
- rdzv_backend: static
- same_network: True
- main_training_function: main
- deepspeed_config: {}
- fsdp_config: {}
- megatron_lm_config: {}
- downcast_bf16: no
cc @muellerzr
@efsotr during my tests I'm able to have it all work properly, however you'll need to specify a new port in your config to launch on, which may stem your issue. For example, here are the two I used during my testing on a 4 GPU machine:
{
"compute_environment": "LOCAL_MACHINE",
"distributed_type": "MULTI_GPU",
"downcast_bf16": false,
"machine_rank": 0,
"main_training_function": "main",
"mixed_precision": "no",
"num_machines": 1,
"num_processes": 2,
"rdzv_backend": "static",
"same_network": false,
"use_cpu": false,
"gpu_ids": "0,1",
"main_process_port": 29501
}
{
"compute_environment": "LOCAL_MACHINE",
"distributed_type": "MULTI_GPU",
"downcast_bf16": false,
"machine_rank": 0,
"main_training_function": "main",
"mixed_precision": "no",
"num_machines": 1,
"num_processes": 2,
"rdzv_backend": "static",
"same_network": false,
"use_cpu": false,
"gpu_ids": "2,3",
"main_process_port": 29502
}
Notice the addition of main_process_port here, as torch distributed needs separate ports for each process as it's being ran. So changing your config ymls to reflect this should solve this issue. Tested both on 0.16.0 and the latest 0.18.0. Please try this out and let me know if you're still facing issues! If you are, I'd need to see the script you are running as well
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.