accelerate icon indicating copy to clipboard operation
accelerate copied to clipboard

Can't use accelerate to launch two programs on one machine

Open efsotr opened this issue 2 years ago • 3 comments

The first program is running with config 1. When I launch the second program with config 2, the second program exits without even providing error information.

config 1:

- `Accelerate` version: 0.16.0
- Platform: Linux-4.15.0-208-generic-x86_64-with-glibc2.27
- Python version: 3.9.16
- Numpy version: 1.23.5
- PyTorch version (GPU?): 1.10.0 (True)
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: MULTI_GPU
        - mixed_precision: no
        - use_cpu: False
        - dynamo_backend: NO
        - num_processes: 4
        - machine_rank: 0
        - num_machines: 1
        - gpu_ids: 0,1,2,3
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - deepspeed_config: {}
        - fsdp_config: {}
        - megatron_lm_config: {}
        - downcast_bf16: no

config2:

- `Accelerate` version: 0.16.0
- Platform: Linux-4.15.0-208-generic-x86_64-with-glibc2.27
- Python version: 3.9.16
- Numpy version: 1.23.5
- PyTorch version (GPU?): 1.10.0 (True)
- `Accelerate` config passed:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: MULTI_GPU
        - mixed_precision: no
        - use_cpu: False
        - dynamo_backend: NO
        - num_processes: 2
        - machine_rank: 0
        - num_machines: 1
        - gpu_ids: 6,7
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - deepspeed_config: {}
        - fsdp_config: {}
        - megatron_lm_config: {}
        - downcast_bf16: no

efsotr avatar Apr 29 '23 16:04 efsotr

cc @muellerzr

sgugger avatar May 01 '23 13:05 sgugger

@efsotr during my tests I'm able to have it all work properly, however you'll need to specify a new port in your config to launch on, which may stem your issue. For example, here are the two I used during my testing on a 4 GPU machine:

{
  "compute_environment": "LOCAL_MACHINE",
  "distributed_type": "MULTI_GPU",
  "downcast_bf16": false,
  "machine_rank": 0,
  "main_training_function": "main",
  "mixed_precision": "no",
  "num_machines": 1,
  "num_processes": 2,
  "rdzv_backend": "static",
  "same_network": false,
  "use_cpu": false,
  "gpu_ids": "0,1",
  "main_process_port": 29501
}
{
  "compute_environment": "LOCAL_MACHINE",
  "distributed_type": "MULTI_GPU",
  "downcast_bf16": false,
  "machine_rank": 0,
  "main_training_function": "main",
  "mixed_precision": "no",
  "num_machines": 1,
  "num_processes": 2,
  "rdzv_backend": "static",
  "same_network": false,
  "use_cpu": false,
  "gpu_ids": "2,3",
  "main_process_port": 29502
}

Notice the addition of main_process_port here, as torch distributed needs separate ports for each process as it's being ran. So changing your config ymls to reflect this should solve this issue. Tested both on 0.16.0 and the latest 0.18.0. Please try this out and let me know if you're still facing issues! If you are, I'd need to see the script you are running as well

muellerzr avatar May 01 '23 14:05 muellerzr

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar May 30 '23 15:05 github-actions[bot]