accelerate icon indicating copy to clipboard operation
accelerate copied to clipboard

ERROR occurs when accelerate test in multi-gpu training

Open DragonDRLI opened this issue 1 year ago • 5 comments

System Info

$ accelerate env

Copy-and-paste the text below in your GitHub issue

- `Accelerate` version: 0.18.0
- Platform: Linux-4.19.91-x86_64-with-debian-buster-sid
- Python version: 3.7.3
- Numpy version: 1.21.6
- PyTorch version (GPU?): 1.11.0+cu113 (True)
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: MULTI_GPU
        - mixed_precision: no
        - use_cpu: False
        - num_processes: 2
        - machine_rank: 0
        - num_machines: 1
        - gpu_ids: 0,1
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []

And accelerate config is:
In which compute environment are you running? This machine
Which type of machine are you using? multi-GPU
How many different machines will you use (use more than 1 for multi-node training)? [1]: 1                                                                                                                         
Do you wish to optimize your script with torch dynamo?[yes/NO]:NO                                                                                                                                                  
Do you want to use DeepSpeed? [yes/NO]: NO                                                                                                                                                                         
Do you want to use FullyShardedDataParallel? [yes/NO]: NO                                                                                                                                                          
Do you want to use Megatron-LM ? [yes/NO]: NO
How many GPU(s) should be used for distributed training? [1]:2
What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all]:0,1
Do you wish to use FP16 or BF16 (mixed precision)? no

logs are:
Running:  accelerate-launch /opt/anaconda3/lib/python3.7/site-packages/accelerate/test_utils/scripts/test_script.py
stdout: **Initialization**
stdout: Testing, testing. 1, 2, 3.
stdout: Distributed environment: MULTI_GPU  Backend: nccl
stdout: Num processes: 2
stdout: Process index: 0
stdout: Local process index: 0
stdout: Device: cuda:0
stdout: 
stdout: Mixed precision type: no
stdout: 
stdout: 
stdout: **Test process execution**
stdout: Distributed environment: MULTI_GPU  Backend: nccl
stdout: Num processes: 2
stdout: Process index: 1
stdout: Local process index: 1
stdout: Device: cuda:1
stdout: 
stdout: Mixed precision type: no
stdout: 
stderr: Traceback (most recent call last):
stderr:   File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/test_utils/scripts/test_script.py", line 408, in <module>
stderr:     main()
stderr:   File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/test_utils/scripts/test_script.py", line 381, in main
stderr:     process_execution_check()
stderr:   File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/test_utils/scripts/test_script.py", line 58, in process_execution_check
stderr:     idxs = accelerator.gather(idx)
stderr:   File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/accelerator.py", line 1815, in gather
stderr: Traceback (most recent call last):
stderr:   File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/test_utils/scripts/test_script.py", line 408, in <module>
stderr:     return gather(tensor)
stderr:   File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/utils/operations.py", line 228, in gather
stderr:     main()
stderr:   File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/test_utils/scripts/test_script.py", line 381, in main
stderr:     return _gpu_gather(tensor)
stderr:   File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/utils/operations.py", line 208, in _gpu_gather
stderr:     return recursively_apply(_gpu_gather_one, tensor, error_on_other_type=True)
stderr:   File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/utils/operations.py", line 101, in recursively_apply
stderr:     process_execution_check()
stderr:   File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/test_utils/scripts/test_script.py", line 58, in process_execution_check
stderr:     return func(data, *args, **kwargs)
stderr:   File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/utils/operations.py", line 205, in _gpu_gather_one
stderr:     idxs = accelerator.gather(idx)
stderr:   File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/accelerator.py", line 1815, in gather
stderr:     torch.distributed.all_gather(output_tensors, tensor)
stderr:   File "/opt/anaconda3/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 2060, in all_gather
stderr:     return gather(tensor)
stderr: work = default_pg.allgather([tensor_list], [tensor])
stderr:   File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/utils/operations.py", line 228, in gather
stderr: RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:47, unhandled cuda error, NCCL version 21.0.3
stderr: ncclUnhandledCudaError: Call to CUDA function failed.
stderr:     return _gpu_gather(tensor)
stderr:   File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/utils/operations.py", line 208, in _gpu_gather
stderr:     return recursively_apply(_gpu_gather_one, tensor, error_on_other_type=True)
stderr:   File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/utils/operations.py", line 101, in recursively_apply
stderr:     return func(data, *args, **kwargs)
stderr:   File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/utils/operations.py", line 205, in _gpu_gather_one
stderr:     torch.distributed.all_gather(output_tensors, tensor)
stderr:   File "/opt/anaconda3/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 2060, in all_gather
stderr:     work = default_pg.allgather([tensor_list], [tensor])
stderr: RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:47, unhandled cuda error, NCCL version 21.0.3
stderr: ncclUnhandledCudaError: Call to CUDA function failed.
stderr: ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 227) of binary: /opt/anaconda3/bin/python
stderr: Traceback (most recent call last):
stderr:   File "/opt/anaconda3/bin/accelerate-launch", line 8, in <module>
stderr:     sys.exit(main())
stderr:   File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/commands/launch.py", line 929, in main
stderr:     launch_command(args)
stderr:   File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/commands/launch.py", line 914, in launch_command
stderr:     multi_gpu_launcher(args)
stderr:   File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/commands/launch.py", line 603, in multi_gpu_launcher
stderr:     distrib_run.run(args)
stderr:   File "/opt/anaconda3/lib/python3.7/site-packages/torch/distributed/run.py", line 718, in run
stderr:     )(*cmd_args)
stderr:   File "/opt/anaconda3/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
stderr:     return launch_agent(self._config, self._entrypoint, list(args))
stderr:   File "/opt/anaconda3/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 247, in launch_agent
stderr:     failures=result.failures,
stderr: torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
stderr: ============================================================
stderr: /opt/anaconda3/lib/python3.7/site-packages/accelerate/test_utils/scripts/test_script.py FAILED
stderr: ------------------------------------------------------------
stderr: Failures:
stderr: [1]:
stderr:   time      : 2023-05-24_10:51:51
stderr:   host      : v-dev-multi-11153949-86c58f645-87vgg
stderr:   rank      : 1 (local_rank: 1)
stderr:   exitcode  : 1 (pid: 228)
stderr:   error_file: <N/A>
stderr:   traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
stderr: ------------------------------------------------------------
stderr: Root Cause (first observed failure):
stderr: [0]:
stderr:   time      : 2023-05-24_10:51:51
stderr:   host      : v-dev-multi-11153949-86c58f645-87vgg
stderr:   rank      : 0 (local_rank: 0)
stderr:   exitcode  : 1 (pid: 227)
stderr:   error_file: <N/A>
stderr:   traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
stderr: ============================================================
Traceback (most recent call last):
  File "/opt/anaconda3/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/commands/test.py", line 54, in test_command
    result = execute_subprocess_async(cmd, env=os.environ.copy())
  File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/test_utils/testing.py", line 360, in execute_subprocess_async
    f"'{cmd_str}' failed with returncode {result.returncode}\n\n"
RuntimeError: 'accelerate-launch /opt/anaconda3/lib/python3.7/site-packages/accelerate/test_utils/scripts/test_script.py' failed with returncode 1

The combined stderr from workers follows:
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/test_utils/scripts/test_script.py", line 408, in <module>
    main()
  File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/test_utils/scripts/test_script.py", line 381, in main
    process_execution_check()
  File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/test_utils/scripts/test_script.py", line 58, in process_execution_check
    idxs = accelerator.gather(idx)
  File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/accelerator.py", line 1815, in gather
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/test_utils/scripts/test_script.py", line 408, in <module>
    return gather(tensor)
  File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/utils/operations.py", line 228, in gather
    main()
  File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/test_utils/scripts/test_script.py", line 381, in main
    return _gpu_gather(tensor)
  File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/utils/operations.py", line 208, in _gpu_gather
    return recursively_apply(_gpu_gather_one, tensor, error_on_other_type=True)
  File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/utils/operations.py", line 101, in recursively_apply
    process_execution_check()
  File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/test_utils/scripts/test_script.py", line 58, in process_execution_check
    return func(data, *args, **kwargs)
  File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/utils/operations.py", line 205, in _gpu_gather_one
    idxs = accelerator.gather(idx)
  File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/accelerator.py", line 1815, in gather
    torch.distributed.all_gather(output_tensors, tensor)
  File "/opt/anaconda3/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 2060, in all_gather
    return gather(tensor)
work = default_pg.allgather([tensor_list], [tensor])
  File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/utils/operations.py", line 228, in gather
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:47, unhandled cuda error, NCCL version 21.0.3
ncclUnhandledCudaError: Call to CUDA function failed.
    return _gpu_gather(tensor)
  File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/utils/operations.py", line 208, in _gpu_gather
    return recursively_apply(_gpu_gather_one, tensor, error_on_other_type=True)
  File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/utils/operations.py", line 101, in recursively_apply
    return func(data, *args, **kwargs)
  File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/utils/operations.py", line 205, in _gpu_gather_one
    torch.distributed.all_gather(output_tensors, tensor)
  File "/opt/anaconda3/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 2060, in all_gather
    work = default_pg.allgather([tensor_list], [tensor])
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:47, unhandled cuda error, NCCL version 21.0.3
ncclUnhandledCudaError: Call to CUDA function failed.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 227) of binary: /opt/anaconda3/bin/python
Traceback (most recent call last):
  File "/opt/anaconda3/bin/accelerate-launch", line 8, in <module>
    sys.exit(main())
  File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/commands/launch.py", line 929, in main
    launch_command(args)
  File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/commands/launch.py", line 914, in launch_command
    multi_gpu_launcher(args)
  File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/commands/launch.py", line 603, in multi_gpu_launcher
    distrib_run.run(args)
  File "/opt/anaconda3/lib/python3.7/site-packages/torch/distributed/run.py", line 718, in run
    )(*cmd_args)
  File "/opt/anaconda3/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/anaconda3/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 247, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/opt/anaconda3/lib/python3.7/site-packages/accelerate/test_utils/scripts/test_script.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-05-24_10:51:51
  host      : v-dev-multi-11153949-86c58f645-87vgg
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 228)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-05-24_10:51:51
  host      : v-dev-multi-11153949-86c58f645-87vgg
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 227)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Information

  • [ ] The official example scripts
  • [X] My own modified scripts

Tasks

  • [ ] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • [X] My own task or dataset (give details below)

Reproduction

https://github.com/huggingface/diffusers/blob/716286f19ddd9eb417113e064b538706884c8e73/examples/text_to_image/train_text_to_image.py download the train_text_to_image.py and run accelerate launch train_text_to_image.py --args

Expected behavior

When you use only one GPU, the training can proceed normally, but when you use two GPUs, the aforementioned error occurs.

DragonDRLI avatar May 24 '23 09:05 DragonDRLI

What kind of gpu setup are you using?

muellerzr avatar May 24 '23 10:05 muellerzr

Thank you for your prompt response and the kind of gpu used is Tesla T4.

DragonDRLI avatar May 24 '23 11:05 DragonDRLI

@DragonDRLI can you try specifying "gpu_ids" as "all" in your config?

Check vim ~/.cache/huggingface/accelerate/default_config.yaml and do:

gpu_ids: all

(Notice no quotes)

muellerzr avatar May 24 '23 15:05 muellerzr

Thank you for your prompt response! However, it doesn't work for my question. Are there any other possible solutions? @muellerzr

DragonDRLI avatar May 25 '23 03:05 DragonDRLI

@DragonDRLI can you try perhaps upgrading your torch version? (Doubtful, but having some issues recreating this).

E.g.: pip install light-the-torch; ltt install torch torchvision -U

muellerzr avatar May 26 '23 14:05 muellerzr

@muellerzr Even I am facing the same issue on one of my servers. On one GPU it runs fine, But in Multi GPU issue occurs, Server details : NVIDIA-SMI 450.51.06 Driver Version: 450.51.06 CUDA Version: 11.0 , A100

But ,In another server (Another Machine )accelerate is working fine with Multi GPU and single GPU. Its details NVIDIA-SMI 510.47.03 , Driver Version: 510.47.03 ,CUDA Version: 11.6 ,TESLA V100

Both the servers the environment is same.

System Info : Copy-and-paste the text below in your GitHub issue

  • Accelerate version: 0.20.3
  • Platform: Linux-5.4.0-42-generic-x86_64-with-glibc2.27
  • Python version: 3.11.3
  • Numpy version: 1.24.3
  • PyTorch version (GPU?): 2.0.1+cu117 (True)
  • PyTorch XPU available: False
  • System RAM: 1007.70 GB
  • GPU type: A100-SXM4-40GB
  • Accelerate default config: - compute_environment: LOCAL_MACHINE - distributed_type: MULTI_GPU - mixed_precision: no - use_cpu: False - num_processes: 2 - machine_rank: 0 - num_machines: 1 - gpu_ids: 6,7 - rdzv_backend: static - same_network: True - main_training_function: main - downcast_bf16: no - tpu_use_cluster: False - tpu_use_sudo: False - tpu_env: []

Config :

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------In which compute environment are you running? This machine
Which type of machine are you using? multi-GPU
How many different machines will you use (use more than 1 for multi-node training)? [1]: 1
Do you wish to optimize your script with torch dynamo?[yes/NO]:NO
Do you want to use DeepSpeed? [yes/NO]: NO
Do you want to use FullyShardedDataParallel? [yes/NO]: NO
Do you want to use Megatron-LM ? [yes/NO]: NO
How many GPU(s) should be used for distributed training? [1]:2 What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all]:4,5 Do you wish to use FP16 or BF16 (mixed precision)? no

Command : accelerate test Logs : Running: accelerate-launch /home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/test_utils/scripts/test_script.py stdout: Distributed environment: DistributedType.MULTI_GPU Backend: nccl stdout: Num processes: 2 stdout: Process index: 1 stdout: Local process index: 1 stdout: Device: cuda:1 stdout: stdout: Mixed precision type: no stdout: stdout: Initialization stdout: Testing, testing. 1, 2, 3. stdout: Distributed environment: DistributedType.MULTI_GPU Backend: nccl stdout: Num processes: 2 stdout: Process index: 0 stdout: Local process index: 0 stdout: Device: cuda:0 stdout: stdout: Mixed precision type: no stdout: stdout: stdout: Test process execution stderr: Traceback (most recent call last): stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/test_utils/scripts/test_script.py", line 553, in stderr: Traceback (most recent call last): stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/test_utils/scripts/test_script.py", line 553, in stderr: main() stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/test_utils/scripts/test_script.py", line 519, in main stderr: main() stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/test_utils/scripts/test_script.py", line 519, in main stderr: process_execution_check() stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/test_utils/scripts/test_script.py", line 65, in process_execution_check stderr: process_execution_check() stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/test_utils/scripts/test_script.py", line 65, in process_execution_check stderr: with accelerator.main_process_first(): stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/contextlib.py", line 144, in exit stderr: with accelerator.main_process_first(): stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/contextlib.py", line 137, in enter stderr: next(self.gen) stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/accelerator.py", line 788, in main_process_first stderr: return next(self.gen) stderr: ^^^^^^^^^^^^^^ stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/accelerator.py", line 788, in main_process_first stderr: with self.state.main_process_first(): stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/contextlib.py", line 144, in exit stderr: with self.state.main_process_first(): stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/contextlib.py", line 137, in enter stderr: next(self.gen) stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/state.py", line 864, in main_process_first stderr: return next(self.gen) stderr: ^^^^^^^^^^^^^^ stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/state.py", line 864, in main_process_first stderr: with PartialState().main_process_first(): stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/contextlib.py", line 144, in exit stderr: with PartialState().main_process_first(): stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/contextlib.py", line 137, in enter stderr: next(self.gen) stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/state.py", line 439, in main_process_first stderr: return next(self.gen) stderr: ^^^^^^^^^^^^^^ stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/state.py", line 439, in main_process_first stderr: yield from self._goes_first(self.is_main_process) stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/state.py", line 340, in _goes_first stderr: yield from self._goes_first(self.is_main_process) stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/state.py", line 335, in _goes_first stderr: self.wait_for_everyone() stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/state.py", line 329, in wait_for_everyone stderr: self.wait_for_everyone() stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/state.py", line 329, in wait_for_everyone stderr: torch.distributed.barrier() stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 3328, in barrier stderr: torch.distributed.barrier() stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 3328, in barrier stderr: work = default_pg.barrier(opts=opts) stderr: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ stderr: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1275, internal error, NCCL version 2.14.3 stderr: ncclInternalError: Internal check failed. stderr: Last error: stderr: Cuda failure 'API call is not supported in the installed CUDA driver' stderr: work = default_pg.barrier(opts=opts) stderr: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ stderr: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1275, internal error, NCCL version 2.14.3 stderr: ncclInternalError: Internal check failed. stderr: Last error: stderr: Cuda failure 'API call is not supported in the installed CUDA driver' stderr: ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 261882) of binary: /home/student/2022/ai22mtech12005/miniconda3/envs/climb/bin/python stderr: Traceback (most recent call last): stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/bin/accelerate-launch", line 8, in stderr: sys.exit(main()) stderr: ^^^^^^ stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/commands/launch.py", line 947, in main stderr: launch_command(args) stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/commands/launch.py", line 932, in launch_command stderr: multi_gpu_launcher(args) stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/commands/launch.py", line 627, in multi_gpu_launcher stderr: distrib_run.run(args) stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/torch/distributed/run.py", line 785, in run stderr: elastic_launch( stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 134, in call stderr: return launch_agent(self._config, self._entrypoint, list(args)) stderr: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent stderr: raise ChildFailedError( stderr: torch.distributed.elastic.multiprocessing.errors.ChildFailedError: stderr: ============================================================ stderr: /home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/test_utils/scripts/test_script.py FAILED stderr: ------------------------------------------------------------ stderr: Failures: stderr: [1]: stderr: time : 2023-06-15_01:04:01 stderr: host : dgx-a100-02.cse.iith.ac.in stderr: rank : 1 (local_rank: 1) stderr: exitcode : 1 (pid: 261883) stderr: error_file: <N/A> stderr: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html stderr: ------------------------------------------------------------ stderr: Root Cause (first observed failure): stderr: [0]: stderr: time : 2023-06-15_01:04:01 stderr: host : dgx-a100-02.cse.iith.ac.in stderr: rank : 0 (local_rank: 0) stderr: exitcode : 1 (pid: 261882) stderr: error_file: <N/A> stderr: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html stderr: ============================================================ Traceback (most recent call last): File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/bin/accelerate", line 8, in sys.exit(main()) ^^^^^^ File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main args.func(args) File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/commands/test.py", line 54, in test_command result = execute_subprocess_async(cmd, env=os.environ.copy()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/test_utils/testing.py", line 383, in execute_subprocess_async raise RuntimeError( RuntimeError: 'accelerate-launch /home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/test_utils/scripts/test_script.py' failed with returncode 1

The combined stderr from workers follows: Traceback (most recent call last): File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/test_utils/scripts/test_script.py", line 553, in Traceback (most recent call last): File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/test_utils/scripts/test_script.py", line 553, in main() File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/test_utils/scripts/test_script.py", line 519, in main main() File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/test_utils/scripts/test_script.py", line 519, in main process_execution_check() File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/test_utils/scripts/test_script.py", line 65, in process_execution_check process_execution_check() File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/test_utils/scripts/test_script.py", line 65, in process_execution_check with accelerator.main_process_first(): File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/contextlib.py", line 144, in exit with accelerator.main_process_first(): File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/contextlib.py", line 137, in enter next(self.gen) File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/accelerator.py", line 788, in main_process_first return next(self.gen) ^^^^^^^^^^^^^^ File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/accelerator.py", line 788, in main_process_first with self.state.main_process_first(): File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/contextlib.py", line 144, in exit with self.state.main_process_first(): File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/contextlib.py", line 137, in enter next(self.gen) File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/state.py", line 864, in main_process_first return next(self.gen) ^^^^^^^^^^^^^^ File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/state.py", line 864, in main_process_first with PartialState().main_process_first(): File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/contextlib.py", line 144, in exit with PartialState().main_process_first(): File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/contextlib.py", line 137, in enter next(self.gen) File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/state.py", line 439, in main_process_first return next(self.gen) ^^^^^^^^^^^^^^ File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/state.py", line 439, in main_process_first yield from self._goes_first(self.is_main_process) File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/state.py", line 340, in _goes_first yield from self._goes_first(self.is_main_process) File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/state.py", line 335, in _goes_first self.wait_for_everyone() File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/state.py", line 329, in wait_for_everyone self.wait_for_everyone() File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/state.py", line 329, in wait_for_everyone torch.distributed.barrier() File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 3328, in barrier torch.distributed.barrier() File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 3328, in barrier work = default_pg.barrier(opts=opts) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1275, internal error, NCCL version 2.14.3 ncclInternalError: Internal check failed. Last error: Cuda failure 'API call is not supported in the installed CUDA driver' work = default_pg.barrier(opts=opts) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1275, internal error, NCCL version 2.14.3 ncclInternalError: Internal check failed. Last error: Cuda failure 'API call is not supported in the installed CUDA driver' ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 261882) of binary: /home/student/2022/ai22mtech12005/miniconda3/envs/climb/bin/python Traceback (most recent call last): File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/bin/accelerate-launch", line 8, in sys.exit(main()) ^^^^^^ File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/commands/launch.py", line 947, in main launch_command(args) File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/commands/launch.py", line 932, in launch_command multi_gpu_launcher(args) File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/commands/launch.py", line 627, in multi_gpu_launcher distrib_run.run(args) File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/test_utils/scripts/test_script.py FAILED

Failures: [1]: time : 2023-06-15_01:04:01 host : dgx-a100 rank : 1 (local_rank: 1) exitcode : 1 (pid: 261883) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2023-06-15_01:04:01 host : dgx-a100 rank : 0 (local_rank: 0) exitcode : 1 (pid: 261882) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

LokeshJatangi avatar Jun 14 '23 19:06 LokeshJatangi

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Jul 09 '23 15:07 github-actions[bot]

@muellerzr Even I am facing the same issue on one of my servers. On one GPU it runs fine, But in Multi GPU issue occurs, Server details : NVIDIA-SMI 450.51.06 Driver Version: 450.51.06 CUDA Version: 11.0 , A100

But ,In another server (Another Machine )accelerate is working fine with Multi GPU and single GPU. Its details NVIDIA-SMI 510.47.03 , Driver Version: 510.47.03 ,CUDA Version: 11.6 ,TESLA V100

Both the servers the environment is same.

System Info : Copy-and-paste the text below in your GitHub issue

  • Accelerate version: 0.20.3
  • Platform: Linux-5.4.0-42-generic-x86_64-with-glibc2.27
  • Python version: 3.11.3
  • Numpy version: 1.24.3
  • PyTorch version (GPU?): 2.0.1+cu117 (True)
  • PyTorch XPU available: False
  • System RAM: 1007.70 GB
  • GPU type: A100-SXM4-40GB
  • Accelerate default config:
    • compute_environment: LOCAL_MACHINE
    • distributed_type: MULTI_GPU
    • mixed_precision: no
    • use_cpu: False
    • num_processes: 2
    • machine_rank: 0
    • num_machines: 1
    • gpu_ids: 6,7
    • rdzv_backend: static
    • same_network: True
    • main_training_function: main
    • downcast_bf16: no
    • tpu_use_cluster: False
    • tpu_use_sudo: False
    • tpu_env: []

Config :

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------In which compute environment are you running? This machine Which type of machine are you using? multi-GPU How many different machines will you use (use more than 1 for multi-node training)? [1]: 1 Do you wish to optimize your script with torch dynamo?[yes/NO]:NO Do you want to use DeepSpeed? [yes/NO]: NO Do you want to use FullyShardedDataParallel? [yes/NO]: NO Do you want to use Megatron-LM ? [yes/NO]: NO How many GPU(s) should be used for distributed training? [1]:2 What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all]:4,5 Do you wish to use FP16 or BF16 (mixed precision)? no

Command : accelerate test Logs : Running: accelerate-launch /home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/test_utils/scripts/test_script.py stdout: Distributed environment: DistributedType.MULTI_GPU Backend: nccl stdout: Num processes: 2 stdout: Process index: 1 stdout: Local process index: 1 stdout: Device: cuda:1 stdout: stdout: Mixed precision type: no stdout: stdout: Initialization stdout: Testing, testing. 1, 2, 3. stdout: Distributed environment: DistributedType.MULTI_GPU Backend: nccl stdout: Num processes: 2 stdout: Process index: 0 stdout: Local process index: 0 stdout: Device: cuda:0 stdout: stdout: Mixed precision type: no stdout: stdout: stdout: Test process execution stderr: Traceback (most recent call last): stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/test_utils/scripts/test_script.py", line 553, in stderr: Traceback (most recent call last): stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/test_utils/scripts/test_script.py", line 553, in stderr: main() stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/test_utils/scripts/test_script.py", line 519, in main stderr: main() stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/test_utils/scripts/test_script.py", line 519, in main stderr: process_execution_check() stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/test_utils/scripts/test_script.py", line 65, in process_execution_check stderr: process_execution_check() stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/test_utils/scripts/test_script.py", line 65, in process_execution_check stderr: with accelerator.main_process_first(): stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/contextlib.py", line 144, in exit stderr: with accelerator.main_process_first(): stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/contextlib.py", line 137, in enter stderr: next(self.gen) stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/accelerator.py", line 788, in main_process_first stderr: return next(self.gen) stderr: ^^^^^^^^^^^^^^ stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/accelerator.py", line 788, in main_process_first stderr: with self.state.main_process_first(): stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/contextlib.py", line 144, in exit stderr: with self.state.main_process_first(): stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/contextlib.py", line 137, in enter stderr: next(self.gen) stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/state.py", line 864, in main_process_first stderr: return next(self.gen) stderr: ^^^^^^^^^^^^^^ stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/state.py", line 864, in main_process_first stderr: with PartialState().main_process_first(): stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/contextlib.py", line 144, in exit stderr: with PartialState().main_process_first(): stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/contextlib.py", line 137, in enter stderr: next(self.gen) stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/state.py", line 439, in main_process_first stderr: return next(self.gen) stderr: ^^^^^^^^^^^^^^ stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/state.py", line 439, in main_process_first stderr: yield from self._goes_first(self.is_main_process) stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/state.py", line 340, in _goes_first stderr: yield from self._goes_first(self.is_main_process) stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/state.py", line 335, in _goes_first stderr: self.wait_for_everyone() stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/state.py", line 329, in wait_for_everyone stderr: self.wait_for_everyone() stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/state.py", line 329, in wait_for_everyone stderr: torch.distributed.barrier() stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 3328, in barrier stderr: torch.distributed.barrier() stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 3328, in barrier stderr: work = default_pg.barrier(opts=opts) stderr: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ stderr: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1275, internal error, NCCL version 2.14.3 stderr: ncclInternalError: Internal check failed. stderr: Last error: stderr: Cuda failure 'API call is not supported in the installed CUDA driver' stderr: work = default_pg.barrier(opts=opts) stderr: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ stderr: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1275, internal error, NCCL version 2.14.3 stderr: ncclInternalError: Internal check failed. stderr: Last error: stderr: Cuda failure 'API call is not supported in the installed CUDA driver' stderr: ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 261882) of binary: /home/student/2022/ai22mtech12005/miniconda3/envs/climb/bin/python stderr: Traceback (most recent call last): stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/bin/accelerate-launch", line 8, in stderr: sys.exit(main()) stderr: ^^^^^^ stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/commands/launch.py", line 947, in main stderr: launch_command(args) stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/commands/launch.py", line 932, in launch_command stderr: multi_gpu_launcher(args) stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/commands/launch.py", line 627, in multi_gpu_launcher stderr: distrib_run.run(args) stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/torch/distributed/run.py", line 785, in run stderr: elastic_launch( stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 134, in call stderr: return launch_agent(self._config, self._entrypoint, list(args)) stderr: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent stderr: raise ChildFailedError( stderr: torch.distributed.elastic.multiprocessing.errors.ChildFailedError: stderr: ============================================================ stderr: /home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/test_utils/scripts/test_script.py FAILED stderr: ------------------------------------------------------------ stderr: Failures: stderr: [1]: stderr: time : 2023-06-15_01:04:01 stderr: host : dgx-a100-02.cse.iith.ac.in stderr: rank : 1 (local_rank: 1) stderr: exitcode : 1 (pid: 261883) stderr: error_file: <N/A> stderr: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html stderr: ------------------------------------------------------------ stderr: Root Cause (first observed failure): stderr: [0]: stderr: time : 2023-06-15_01:04:01 stderr: host : dgx-a100-02.cse.iith.ac.in stderr: rank : 0 (local_rank: 0) stderr: exitcode : 1 (pid: 261882) stderr: error_file: <N/A> stderr: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html stderr: ============================================================ Traceback (most recent call last): File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/bin/accelerate", line 8, in sys.exit(main()) ^^^^^^ File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main args.func(args) File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/commands/test.py", line 54, in test_command result = execute_subprocess_async(cmd, env=os.environ.copy()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/test_utils/testing.py", line 383, in execute_subprocess_async raise RuntimeError( RuntimeError: 'accelerate-launch /home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/test_utils/scripts/test_script.py' failed with returncode 1

The combined stderr from workers follows:

Traceback (most recent call last): File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/test_utils/scripts/test_script.py", line 553, in Traceback (most recent call last): File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/test_utils/scripts/test_script.py", line 553, in main() File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/test_utils/scripts/test_script.py", line 519, in main main() File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/test_utils/scripts/test_script.py", line 519, in main process_execution_check() File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/test_utils/scripts/test_script.py", line 65, in process_execution_check process_execution_check() File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/test_utils/scripts/test_script.py", line 65, in process_execution_check with accelerator.main_process_first(): File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/contextlib.py", line 144, in exit with accelerator.main_process_first(): File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/contextlib.py", line 137, in enter next(self.gen) File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/accelerator.py", line 788, in main_process_first return next(self.gen) ^^^^^^^^^^^^^^ File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/accelerator.py", line 788, in main_process_first with self.state.main_process_first(): File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/contextlib.py", line 144, in exit with self.state.main_process_first(): File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/contextlib.py", line 137, in enter next(self.gen) File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/state.py", line 864, in main_process_first return next(self.gen) ^^^^^^^^^^^^^^ File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/state.py", line 864, in main_process_first with PartialState().main_process_first(): File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/contextlib.py", line 144, in exit with PartialState().main_process_first(): File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/contextlib.py", line 137, in enter next(self.gen) File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/state.py", line 439, in main_process_first return next(self.gen) ^^^^^^^^^^^^^^ File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/state.py", line 439, in main_process_first yield from self._goes_first(self.is_main_process) File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/state.py", line 340, in _goes_first yield from self._goes_first(self.is_main_process) File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/state.py", line 335, in _goes_first self.wait_for_everyone() File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/state.py", line 329, in wait_for_everyone self.wait_for_everyone() File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/state.py", line 329, in wait_for_everyone torch.distributed.barrier() File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 3328, in barrier torch.distributed.barrier() File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 3328, in barrier work = default_pg.barrier(opts=opts) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1275, internal error, NCCL version 2.14.3 ncclInternalError: Internal check failed. Last error: Cuda failure 'API call is not supported in the installed CUDA driver' work = default_pg.barrier(opts=opts) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1275, internal error, NCCL version 2.14.3 ncclInternalError: Internal check failed. Last error: Cuda failure 'API call is not supported in the installed CUDA driver' ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 261882) of binary: /home/student/2022/ai22mtech12005/miniconda3/envs/climb/bin/python Traceback (most recent call last): File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/bin/accelerate-launch", line 8, in sys.exit(main()) ^^^^^^ File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/commands/launch.py", line 947, in main launch_command(args) File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/commands/launch.py", line 932, in launch_command multi_gpu_launcher(args) File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/commands/launch.py", line 627, in multi_gpu_launcher distrib_run.run(args) File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/test_utils/scripts/test_script.py FAILED

Failures:

[1]: time : 2023-06-15_01:04:01 host : dgx-a100 rank : 1 (local_rank: 1) exitcode : 1 (pid: 261883) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):

[0]: time : 2023-06-15_01:04:01 host : dgx-a100 rank : 0 (local_rank: 0) exitcode : 1 (pid: 261882) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Hi there~ I encountered exactly the same problem, have you solved it yet? If you happen to know the possible solution, please show me some hints. Thank you so much!

Lanxin1011 avatar Aug 29 '23 09:08 Lanxin1011

@Lanxin1011 i still haven't resolved issue . But to my knowledge the solution is updating nvidia driver version .

LokeshJatangi avatar Aug 29 '23 09:08 LokeshJatangi

I think i have a similar problem

this is my env

  • Accelerate version: 0.27.2
  • Platform: Linux-5.4.0-96-generic-x86_64-with-glibc2.31
  • Python version: 3.9.5
  • Numpy version: 1.26.2
  • PyTorch version (GPU?): 2.2.0+cu121 (True)
  • PyTorch XPU available: False
  • PyTorch NPU available: False
  • System RAM: 1007.76 GB
  • GPU type: NVIDIA A40
  • Accelerate default config: - compute_environment: LOCAL_MACHINE - distributed_type: FSDP - mixed_precision: bf16 - use_cpu: False - debug: False - num_processes: 3 - machine_rank: 0 - num_machines: 1 - rdzv_backend: static - same_network: True - main_training_function: main - fsdp_config: {'fsdp_auto_wrap_policy': 'TRANSFORMER_BASED_WRAP', 'fsdp_backward_prefetch': 'BACKWARD_PRE', 'fsdp_cpu_ram_efficient_loading': True, 'fsdp_forward_prefetch': False, 'fsdp_offload_params': False, 'fsdp_sharding_strategy': 'FULL_SHARD', 'fsdp_state_dict_type': 'SHARDED_STATE_DICT', 'fsdp_sync_module_states': True, 'fsdp_transformer_layer_cls_to_wrap': 'LlamaDecoderLayer', 'fsdp_use_orig_params': True} - downcast_bf16: no - tpu_use_cluster: False - tpu_use_sudo: False - tpu_env:

and accelerate test log is

Running: accelerate-launch /usr/local/lib/python3.9/dist-packages/accelerate/test_utils/scripts/test_script.py stderr: Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. stdout: Initialization stdout: Testing, testing. 1, 2, 3. stdout: Distributed environment: FSDP Backend: nccl stdout: Num processes: 3 stdout: Process index: 0 stdout: Local process index: 0 stdout: Device: cuda:0 stdout: stdout: Mixed precision type: bf16 stdout: stdout: Distributed environment: FSDP Backend: nccl stdout: Num processes: 3 stdout: Process index: 1 stdout: Local process index: 1 stdout: Device: cuda:1 stdout: stdout: Mixed precision type: bf16 stdout: stdout: Distributed environment: FSDP Backend: nccl stdout: Num processes: 3 stdout: Process index: 2 stdout: Local process index: 2 stdout: Device: cuda:2 stdout: stdout: Mixed precision type: bf16 stdout: stdout: stdout: Test process execution stderr: Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. stdout: stdout: Test split between processes as a list stdout: stdout: Test split between processes as a dict stdout: stdout: Test split between processes as a tensor stdout: stdout: Test random number generator synchronization stdout: All rng are properly synched. stdout: stdout: DataLoader integration test stdout: 1 0 2 tensor([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, stdout: 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, stdout: 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, stdout: 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, stdout: 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, stdout: 90, 91, 92, 93, 94, 95], device='cuda:0') <class 'accelerate.data_loader.DataLoaderShard'> stdout: tensor([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, stdout: 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, stdout: 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, stdout: 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, stdout: 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, stdout: 90, 91, 92, 93, 94, 95], device='cuda:2') <class 'accelerate.data_loader.DataLoaderShard'> stdout: tensor([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, stdout: 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, stdout: 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, stdout: 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, stdout: 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, stdout: 90, 91, 92, 93, 94, 95], device='cuda:1') <class 'accelerate.data_loader.DataLoaderShard'> stderr: Traceback (most recent call last): stderr: File "/usr/local/lib/python3.9/dist-packages/accelerate/test_utils/scripts/test_script.py", line 708, in stderr: Traceback (most recent call last): stderr: File "/usr/local/lib/python3.9/dist-packages/accelerate/test_utils/scripts/test_script.py", line 708, in stderr: Traceback (most recent call last): stderr: File "/usr/local/lib/python3.9/dist-packages/accelerate/test_utils/scripts/test_script.py", line 708, in stderr: main() stderr: File "/usr/local/lib/python3.9/dist-packages/accelerate/test_utils/scripts/test_script.py", line 687, in main stderr: main() stderr: File "/usr/local/lib/python3.9/dist-packages/accelerate/test_utils/scripts/test_script.py", line 687, in main stderr: dl_preparation_check() stderr: File "/usr/local/lib/python3.9/dist-packages/accelerate/test_utils/scripts/test_script.py", line 196, in dl_preparation_check stderr: dl = prepare_data_loader( stderr: File "/usr/local/lib/python3.9/dist-packages/accelerate/data_loader.py", line 858, in prepare_data_loader stderr: dl_preparation_check() stderr: File "/usr/local/lib/python3.9/dist-packages/accelerate/test_utils/scripts/test_script.py", line 196, in dl_preparation_check stderr: main() stderr: dl = prepare_data_loader( File "/usr/local/lib/python3.9/dist-packages/accelerate/test_utils/scripts/test_script.py", line 687, in main stderr: stderr: File "/usr/local/lib/python3.9/dist-packages/accelerate/data_loader.py", line 858, in prepare_data_loader stderr: raise ValueError( stderr: ValueError: To use a DataLoader in split_batches mode, the batch size (8) needs to be a round multiple of the number of processes (3). stderr: dl_preparation_check() stderr: File "/usr/local/lib/python3.9/dist-packages/accelerate/test_utils/scripts/test_script.py", line 196, in dl_preparation_check stderr: raise ValueError( stderr: ValueError: To use a DataLoader in split_batches mode, the batch size (8) needs to be a round multiple of the number of processes (3). stderr: dl = prepare_data_loader( stderr: File "/usr/local/lib/python3.9/dist-packages/accelerate/data_loader.py", line 858, in prepare_data_loader stderr: raise ValueError( stderr: ValueError: To use a DataLoader in split_batches mode, the batch size (8) needs to be a round multiple of the number of processes (3). stderr: [2024-03-14 13:39:21,081] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 27324) of binary: /usr/bin/python3.9 stderr: Traceback (most recent call last): stderr: File "/usr/local/bin/accelerate-launch", line 8, in stderr: sys.exit(main()) stderr: File "/usr/local/lib/python3.9/dist-packages/accelerate/commands/launch.py", line 1029, in main stderr: launch_command(args) stderr: File "/usr/local/lib/python3.9/dist-packages/accelerate/commands/launch.py", line 1010, in launch_command stderr: multi_gpu_launcher(args) stderr: File "/usr/local/lib/python3.9/dist-packages/accelerate/commands/launch.py", line 672, in multi_gpu_launcher stderr: distrib_run.run(args) stderr: File "/usr/local/lib/python3.9/dist-packages/torch/distributed/run.py", line 803, in run stderr: elastic_launch( stderr: File "/usr/local/lib/python3.9/dist-packages/torch/distributed/launcher/api.py", line 135, in call stderr: return launch_agent(self._config, self._entrypoint, list(args)) stderr: File "/usr/local/lib/python3.9/dist-packages/torch/distributed/launcher/api.py", line 268, in launch_agent stderr: raise ChildFailedError( stderr: torch.distributed.elastic.multiprocessing.errors.ChildFailedError: stderr: ============================================================ stderr: /usr/local/lib/python3.9/dist-packages/accelerate/test_utils/scripts/test_script.py FAILED stderr: ------------------------------------------------------------ stderr: Failures: stderr: [1]: stderr: time : 2024-03-14_13:39:21 stderr: host : 919c8ff8c821 stderr: rank : 1 (local_rank: 1) stderr: exitcode : 1 (pid: 27325) stderr: error_file: <N/A> stderr: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html stderr: [2]: stderr: time : 2024-03-14_13:39:21 stderr: host : 919c8ff8c821 stderr: rank : 2 (local_rank: 2) stderr: exitcode : 1 (pid: 27326) stderr: error_file: <N/A> stderr: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html stderr: ------------------------------------------------------------ stderr: Root Cause (first observed failure): stderr: [0]: stderr: time : 2024-03-14_13:39:21 stderr: host : 919c8ff8c821 stderr: rank : 0 (local_rank: 0) stderr: exitcode : 1 (pid: 27324) stderr: error_file: <N/A> stderr: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html`

can someone help me?

sungwo101 avatar Mar 14 '24 05:03 sungwo101