accelerate
accelerate copied to clipboard
ERROR occurs when accelerate test in multi-gpu training
System Info
$ accelerate env
Copy-and-paste the text below in your GitHub issue
- `Accelerate` version: 0.18.0
- Platform: Linux-4.19.91-x86_64-with-debian-buster-sid
- Python version: 3.7.3
- Numpy version: 1.21.6
- PyTorch version (GPU?): 1.11.0+cu113 (True)
- `Accelerate` default config:
- compute_environment: LOCAL_MACHINE
- distributed_type: MULTI_GPU
- mixed_precision: no
- use_cpu: False
- num_processes: 2
- machine_rank: 0
- num_machines: 1
- gpu_ids: 0,1
- rdzv_backend: static
- same_network: True
- main_training_function: main
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
And accelerate config is:
In which compute environment are you running? This machine
Which type of machine are you using? multi-GPU
How many different machines will you use (use more than 1 for multi-node training)? [1]: 1
Do you wish to optimize your script with torch dynamo?[yes/NO]:NO
Do you want to use DeepSpeed? [yes/NO]: NO
Do you want to use FullyShardedDataParallel? [yes/NO]: NO
Do you want to use Megatron-LM ? [yes/NO]: NO
How many GPU(s) should be used for distributed training? [1]:2
What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all]:0,1
Do you wish to use FP16 or BF16 (mixed precision)? no
logs are:
Running: accelerate-launch /opt/anaconda3/lib/python3.7/site-packages/accelerate/test_utils/scripts/test_script.py
stdout: **Initialization**
stdout: Testing, testing. 1, 2, 3.
stdout: Distributed environment: MULTI_GPU Backend: nccl
stdout: Num processes: 2
stdout: Process index: 0
stdout: Local process index: 0
stdout: Device: cuda:0
stdout:
stdout: Mixed precision type: no
stdout:
stdout:
stdout: **Test process execution**
stdout: Distributed environment: MULTI_GPU Backend: nccl
stdout: Num processes: 2
stdout: Process index: 1
stdout: Local process index: 1
stdout: Device: cuda:1
stdout:
stdout: Mixed precision type: no
stdout:
stderr: Traceback (most recent call last):
stderr: File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/test_utils/scripts/test_script.py", line 408, in <module>
stderr: main()
stderr: File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/test_utils/scripts/test_script.py", line 381, in main
stderr: process_execution_check()
stderr: File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/test_utils/scripts/test_script.py", line 58, in process_execution_check
stderr: idxs = accelerator.gather(idx)
stderr: File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/accelerator.py", line 1815, in gather
stderr: Traceback (most recent call last):
stderr: File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/test_utils/scripts/test_script.py", line 408, in <module>
stderr: return gather(tensor)
stderr: File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/utils/operations.py", line 228, in gather
stderr: main()
stderr: File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/test_utils/scripts/test_script.py", line 381, in main
stderr: return _gpu_gather(tensor)
stderr: File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/utils/operations.py", line 208, in _gpu_gather
stderr: return recursively_apply(_gpu_gather_one, tensor, error_on_other_type=True)
stderr: File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/utils/operations.py", line 101, in recursively_apply
stderr: process_execution_check()
stderr: File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/test_utils/scripts/test_script.py", line 58, in process_execution_check
stderr: return func(data, *args, **kwargs)
stderr: File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/utils/operations.py", line 205, in _gpu_gather_one
stderr: idxs = accelerator.gather(idx)
stderr: File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/accelerator.py", line 1815, in gather
stderr: torch.distributed.all_gather(output_tensors, tensor)
stderr: File "/opt/anaconda3/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 2060, in all_gather
stderr: return gather(tensor)
stderr: work = default_pg.allgather([tensor_list], [tensor])
stderr: File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/utils/operations.py", line 228, in gather
stderr: RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:47, unhandled cuda error, NCCL version 21.0.3
stderr: ncclUnhandledCudaError: Call to CUDA function failed.
stderr: return _gpu_gather(tensor)
stderr: File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/utils/operations.py", line 208, in _gpu_gather
stderr: return recursively_apply(_gpu_gather_one, tensor, error_on_other_type=True)
stderr: File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/utils/operations.py", line 101, in recursively_apply
stderr: return func(data, *args, **kwargs)
stderr: File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/utils/operations.py", line 205, in _gpu_gather_one
stderr: torch.distributed.all_gather(output_tensors, tensor)
stderr: File "/opt/anaconda3/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 2060, in all_gather
stderr: work = default_pg.allgather([tensor_list], [tensor])
stderr: RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:47, unhandled cuda error, NCCL version 21.0.3
stderr: ncclUnhandledCudaError: Call to CUDA function failed.
stderr: ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 227) of binary: /opt/anaconda3/bin/python
stderr: Traceback (most recent call last):
stderr: File "/opt/anaconda3/bin/accelerate-launch", line 8, in <module>
stderr: sys.exit(main())
stderr: File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/commands/launch.py", line 929, in main
stderr: launch_command(args)
stderr: File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/commands/launch.py", line 914, in launch_command
stderr: multi_gpu_launcher(args)
stderr: File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/commands/launch.py", line 603, in multi_gpu_launcher
stderr: distrib_run.run(args)
stderr: File "/opt/anaconda3/lib/python3.7/site-packages/torch/distributed/run.py", line 718, in run
stderr: )(*cmd_args)
stderr: File "/opt/anaconda3/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
stderr: return launch_agent(self._config, self._entrypoint, list(args))
stderr: File "/opt/anaconda3/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 247, in launch_agent
stderr: failures=result.failures,
stderr: torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
stderr: ============================================================
stderr: /opt/anaconda3/lib/python3.7/site-packages/accelerate/test_utils/scripts/test_script.py FAILED
stderr: ------------------------------------------------------------
stderr: Failures:
stderr: [1]:
stderr: time : 2023-05-24_10:51:51
stderr: host : v-dev-multi-11153949-86c58f645-87vgg
stderr: rank : 1 (local_rank: 1)
stderr: exitcode : 1 (pid: 228)
stderr: error_file: <N/A>
stderr: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
stderr: ------------------------------------------------------------
stderr: Root Cause (first observed failure):
stderr: [0]:
stderr: time : 2023-05-24_10:51:51
stderr: host : v-dev-multi-11153949-86c58f645-87vgg
stderr: rank : 0 (local_rank: 0)
stderr: exitcode : 1 (pid: 227)
stderr: error_file: <N/A>
stderr: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
stderr: ============================================================
Traceback (most recent call last):
File "/opt/anaconda3/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/commands/test.py", line 54, in test_command
result = execute_subprocess_async(cmd, env=os.environ.copy())
File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/test_utils/testing.py", line 360, in execute_subprocess_async
f"'{cmd_str}' failed with returncode {result.returncode}\n\n"
RuntimeError: 'accelerate-launch /opt/anaconda3/lib/python3.7/site-packages/accelerate/test_utils/scripts/test_script.py' failed with returncode 1
The combined stderr from workers follows:
Traceback (most recent call last):
File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/test_utils/scripts/test_script.py", line 408, in <module>
main()
File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/test_utils/scripts/test_script.py", line 381, in main
process_execution_check()
File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/test_utils/scripts/test_script.py", line 58, in process_execution_check
idxs = accelerator.gather(idx)
File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/accelerator.py", line 1815, in gather
Traceback (most recent call last):
File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/test_utils/scripts/test_script.py", line 408, in <module>
return gather(tensor)
File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/utils/operations.py", line 228, in gather
main()
File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/test_utils/scripts/test_script.py", line 381, in main
return _gpu_gather(tensor)
File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/utils/operations.py", line 208, in _gpu_gather
return recursively_apply(_gpu_gather_one, tensor, error_on_other_type=True)
File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/utils/operations.py", line 101, in recursively_apply
process_execution_check()
File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/test_utils/scripts/test_script.py", line 58, in process_execution_check
return func(data, *args, **kwargs)
File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/utils/operations.py", line 205, in _gpu_gather_one
idxs = accelerator.gather(idx)
File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/accelerator.py", line 1815, in gather
torch.distributed.all_gather(output_tensors, tensor)
File "/opt/anaconda3/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 2060, in all_gather
return gather(tensor)
work = default_pg.allgather([tensor_list], [tensor])
File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/utils/operations.py", line 228, in gather
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:47, unhandled cuda error, NCCL version 21.0.3
ncclUnhandledCudaError: Call to CUDA function failed.
return _gpu_gather(tensor)
File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/utils/operations.py", line 208, in _gpu_gather
return recursively_apply(_gpu_gather_one, tensor, error_on_other_type=True)
File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/utils/operations.py", line 101, in recursively_apply
return func(data, *args, **kwargs)
File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/utils/operations.py", line 205, in _gpu_gather_one
torch.distributed.all_gather(output_tensors, tensor)
File "/opt/anaconda3/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 2060, in all_gather
work = default_pg.allgather([tensor_list], [tensor])
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:47, unhandled cuda error, NCCL version 21.0.3
ncclUnhandledCudaError: Call to CUDA function failed.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 227) of binary: /opt/anaconda3/bin/python
Traceback (most recent call last):
File "/opt/anaconda3/bin/accelerate-launch", line 8, in <module>
sys.exit(main())
File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/commands/launch.py", line 929, in main
launch_command(args)
File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/commands/launch.py", line 914, in launch_command
multi_gpu_launcher(args)
File "/opt/anaconda3/lib/python3.7/site-packages/accelerate/commands/launch.py", line 603, in multi_gpu_launcher
distrib_run.run(args)
File "/opt/anaconda3/lib/python3.7/site-packages/torch/distributed/run.py", line 718, in run
)(*cmd_args)
File "/opt/anaconda3/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/anaconda3/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 247, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/opt/anaconda3/lib/python3.7/site-packages/accelerate/test_utils/scripts/test_script.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2023-05-24_10:51:51
host : v-dev-multi-11153949-86c58f645-87vgg
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 228)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-05-24_10:51:51
host : v-dev-multi-11153949-86c58f645-87vgg
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 227)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] One of the scripts in the examples/ folder of Accelerate or an officially supported
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
) - [X] My own task or dataset (give details below)
Reproduction
https://github.com/huggingface/diffusers/blob/716286f19ddd9eb417113e064b538706884c8e73/examples/text_to_image/train_text_to_image.py download the train_text_to_image.py and run accelerate launch train_text_to_image.py --args
Expected behavior
When you use only one GPU, the training can proceed normally, but when you use two GPUs, the aforementioned error occurs.
What kind of gpu setup are you using?
Thank you for your prompt response and the kind of gpu used is Tesla T4.
@DragonDRLI can you try specifying "gpu_ids" as "all" in your config?
Check vim ~/.cache/huggingface/accelerate/default_config.yaml
and do:
gpu_ids: all
(Notice no quotes)
Thank you for your prompt response! However, it doesn't work for my question. Are there any other possible solutions? @muellerzr
@DragonDRLI can you try perhaps upgrading your torch version? (Doubtful, but having some issues recreating this).
E.g.: pip install light-the-torch; ltt install torch torchvision -U
@muellerzr Even I am facing the same issue on one of my servers. On one GPU it runs fine, But in Multi GPU issue occurs, Server details : NVIDIA-SMI 450.51.06 Driver Version: 450.51.06 CUDA Version: 11.0 , A100
But ,In another server (Another Machine )accelerate is working fine with Multi GPU and single GPU. Its details NVIDIA-SMI 510.47.03 , Driver Version: 510.47.03 ,CUDA Version: 11.6 ,TESLA V100
Both the servers the environment is same.
System Info : Copy-and-paste the text below in your GitHub issue
-
Accelerate
version: 0.20.3 - Platform: Linux-5.4.0-42-generic-x86_64-with-glibc2.27
- Python version: 3.11.3
- Numpy version: 1.24.3
- PyTorch version (GPU?): 2.0.1+cu117 (True)
- PyTorch XPU available: False
- System RAM: 1007.70 GB
- GPU type: A100-SXM4-40GB
-
Accelerate
default config: - compute_environment: LOCAL_MACHINE - distributed_type: MULTI_GPU - mixed_precision: no - use_cpu: False - num_processes: 2 - machine_rank: 0 - num_machines: 1 - gpu_ids: 6,7 - rdzv_backend: static - same_network: True - main_training_function: main - downcast_bf16: no - tpu_use_cluster: False - tpu_use_sudo: False - tpu_env: []
Config :
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------In which compute environment are you running?
This machine
Which type of machine are you using? multi-GPU
How many different machines will you use (use more than 1 for multi-node training)? [1]: 1
Do you wish to optimize your script with torch dynamo?[yes/NO]:NO
Do you want to use DeepSpeed? [yes/NO]: NO
Do you want to use FullyShardedDataParallel? [yes/NO]: NO
Do you want to use Megatron-LM ? [yes/NO]: NO
How many GPU(s) should be used for distributed training? [1]:2
What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all]:4,5
Do you wish to use FP16 or BF16 (mixed precision)? no
Command : accelerate test
Logs :
Running: accelerate-launch /home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/test_utils/scripts/test_script.py
stdout: Distributed environment: DistributedType.MULTI_GPU Backend: nccl
stdout: Num processes: 2
stdout: Process index: 1
stdout: Local process index: 1
stdout: Device: cuda:1
stdout:
stdout: Mixed precision type: no
stdout:
stdout: Initialization
stdout: Testing, testing. 1, 2, 3.
stdout: Distributed environment: DistributedType.MULTI_GPU Backend: nccl
stdout: Num processes: 2
stdout: Process index: 0
stdout: Local process index: 0
stdout: Device: cuda:0
stdout:
stdout: Mixed precision type: no
stdout:
stdout:
stdout: Test process execution
stderr: Traceback (most recent call last):
stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/test_utils/scripts/test_script.py", line 553, in
The combined stderr from workers follows:
Traceback (most recent call last):
File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/test_utils/scripts/test_script.py", line 553, in
Traceback (most recent call last):
File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/test_utils/scripts/test_script.py", line 553, in
main()
File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/test_utils/scripts/test_script.py", line 519, in main
main()
File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/test_utils/scripts/test_script.py", line 519, in main
process_execution_check()
File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/test_utils/scripts/test_script.py", line 65, in process_execution_check
process_execution_check()
File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/test_utils/scripts/test_script.py", line 65, in process_execution_check
with accelerator.main_process_first():
File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/contextlib.py", line 144, in exit
with accelerator.main_process_first():
File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/contextlib.py", line 137, in enter
next(self.gen)
File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/accelerator.py", line 788, in main_process_first
return next(self.gen)
^^^^^^^^^^^^^^
File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/accelerator.py", line 788, in main_process_first
with self.state.main_process_first():
File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/contextlib.py", line 144, in exit
with self.state.main_process_first():
File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/contextlib.py", line 137, in enter
next(self.gen)
File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/state.py", line 864, in main_process_first
return next(self.gen)
^^^^^^^^^^^^^^
File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/state.py", line 864, in main_process_first
with PartialState().main_process_first():
File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/contextlib.py", line 144, in exit
with PartialState().main_process_first():
File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/contextlib.py", line 137, in enter
next(self.gen)
File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/state.py", line 439, in main_process_first
return next(self.gen)
^^^^^^^^^^^^^^
File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/state.py", line 439, in main_process_first
yield from self._goes_first(self.is_main_process)
File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/state.py", line 340, in _goes_first
yield from self._goes_first(self.is_main_process)
File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/state.py", line 335, in _goes_first
self.wait_for_everyone()
File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/state.py", line 329, in wait_for_everyone
self.wait_for_everyone()
File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/state.py", line 329, in wait_for_everyone
torch.distributed.barrier()
File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 3328, in barrier
torch.distributed.barrier()
File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 3328, in barrier
work = default_pg.barrier(opts=opts)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1275, internal error, NCCL version 2.14.3
ncclInternalError: Internal check failed.
Last error:
Cuda failure 'API call is not supported in the installed CUDA driver'
work = default_pg.barrier(opts=opts)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1275, internal error, NCCL version 2.14.3
ncclInternalError: Internal check failed.
Last error:
Cuda failure 'API call is not supported in the installed CUDA driver'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 261882) of binary: /home/student/2022/ai22mtech12005/miniconda3/envs/climb/bin/python
Traceback (most recent call last):
File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/bin/accelerate-launch", line 8, in
sys.exit(main())
^^^^^^
File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/commands/launch.py", line 947, in main
launch_command(args)
File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/commands/launch.py", line 932, in launch_command
multi_gpu_launcher(args)
File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/commands/launch.py", line 627, in multi_gpu_launcher
distrib_run.run(args)
File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/test_utils/scripts/test_script.py FAILED
Failures: [1]: time : 2023-06-15_01:04:01 host : dgx-a100 rank : 1 (local_rank: 1) exitcode : 1 (pid: 261883) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Root Cause (first observed failure): [0]: time : 2023-06-15_01:04:01 host : dgx-a100 rank : 0 (local_rank: 0) exitcode : 1 (pid: 261882) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
@muellerzr Even I am facing the same issue on one of my servers. On one GPU it runs fine, But in Multi GPU issue occurs, Server details : NVIDIA-SMI 450.51.06 Driver Version: 450.51.06 CUDA Version: 11.0 , A100
But ,In another server (Another Machine )accelerate is working fine with Multi GPU and single GPU. Its details NVIDIA-SMI 510.47.03 , Driver Version: 510.47.03 ,CUDA Version: 11.6 ,TESLA V100
Both the servers the environment is same.
System Info : Copy-and-paste the text below in your GitHub issue
Accelerate
version: 0.20.3- Platform: Linux-5.4.0-42-generic-x86_64-with-glibc2.27
- Python version: 3.11.3
- Numpy version: 1.24.3
- PyTorch version (GPU?): 2.0.1+cu117 (True)
- PyTorch XPU available: False
- System RAM: 1007.70 GB
- GPU type: A100-SXM4-40GB
Accelerate
default config:
- compute_environment: LOCAL_MACHINE
- distributed_type: MULTI_GPU
- mixed_precision: no
- use_cpu: False
- num_processes: 2
- machine_rank: 0
- num_machines: 1
- gpu_ids: 6,7
- rdzv_backend: static
- same_network: True
- main_training_function: main
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
Config :
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------In which compute environment are you running? This machine Which type of machine are you using? multi-GPU How many different machines will you use (use more than 1 for multi-node training)? [1]: 1 Do you wish to optimize your script with torch dynamo?[yes/NO]:NO Do you want to use DeepSpeed? [yes/NO]: NO Do you want to use FullyShardedDataParallel? [yes/NO]: NO Do you want to use Megatron-LM ? [yes/NO]: NO How many GPU(s) should be used for distributed training? [1]:2 What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all]:4,5 Do you wish to use FP16 or BF16 (mixed precision)? no
Command : accelerate test Logs : Running: accelerate-launch /home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/test_utils/scripts/test_script.py stdout: Distributed environment: DistributedType.MULTI_GPU Backend: nccl stdout: Num processes: 2 stdout: Process index: 1 stdout: Local process index: 1 stdout: Device: cuda:1 stdout: stdout: Mixed precision type: no stdout: stdout: Initialization stdout: Testing, testing. 1, 2, 3. stdout: Distributed environment: DistributedType.MULTI_GPU Backend: nccl stdout: Num processes: 2 stdout: Process index: 0 stdout: Local process index: 0 stdout: Device: cuda:0 stdout: stdout: Mixed precision type: no stdout: stdout: stdout: Test process execution stderr: Traceback (most recent call last): stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/test_utils/scripts/test_script.py", line 553, in stderr: Traceback (most recent call last): stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/test_utils/scripts/test_script.py", line 553, in stderr: main() stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/test_utils/scripts/test_script.py", line 519, in main stderr: main() stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/test_utils/scripts/test_script.py", line 519, in main stderr: process_execution_check() stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/test_utils/scripts/test_script.py", line 65, in process_execution_check stderr: process_execution_check() stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/test_utils/scripts/test_script.py", line 65, in process_execution_check stderr: with accelerator.main_process_first(): stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/contextlib.py", line 144, in exit stderr: with accelerator.main_process_first(): stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/contextlib.py", line 137, in enter stderr: next(self.gen) stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/accelerator.py", line 788, in main_process_first stderr: return next(self.gen) stderr: ^^^^^^^^^^^^^^ stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/accelerator.py", line 788, in main_process_first stderr: with self.state.main_process_first(): stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/contextlib.py", line 144, in exit stderr: with self.state.main_process_first(): stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/contextlib.py", line 137, in enter stderr: next(self.gen) stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/state.py", line 864, in main_process_first stderr: return next(self.gen) stderr: ^^^^^^^^^^^^^^ stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/state.py", line 864, in main_process_first stderr: with PartialState().main_process_first(): stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/contextlib.py", line 144, in exit stderr: with PartialState().main_process_first(): stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/contextlib.py", line 137, in enter stderr: next(self.gen) stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/state.py", line 439, in main_process_first stderr: return next(self.gen) stderr: ^^^^^^^^^^^^^^ stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/state.py", line 439, in main_process_first stderr: yield from self._goes_first(self.is_main_process) stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/state.py", line 340, in _goes_first stderr: yield from self._goes_first(self.is_main_process) stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/state.py", line 335, in _goes_first stderr: self.wait_for_everyone() stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/state.py", line 329, in wait_for_everyone stderr: self.wait_for_everyone() stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/state.py", line 329, in wait_for_everyone stderr: torch.distributed.barrier() stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 3328, in barrier stderr: torch.distributed.barrier() stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 3328, in barrier stderr: work = default_pg.barrier(opts=opts) stderr: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ stderr: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1275, internal error, NCCL version 2.14.3 stderr: ncclInternalError: Internal check failed. stderr: Last error: stderr: Cuda failure 'API call is not supported in the installed CUDA driver' stderr: work = default_pg.barrier(opts=opts) stderr: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ stderr: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1275, internal error, NCCL version 2.14.3 stderr: ncclInternalError: Internal check failed. stderr: Last error: stderr: Cuda failure 'API call is not supported in the installed CUDA driver' stderr: ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 261882) of binary: /home/student/2022/ai22mtech12005/miniconda3/envs/climb/bin/python stderr: Traceback (most recent call last): stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/bin/accelerate-launch", line 8, in stderr: sys.exit(main()) stderr: ^^^^^^ stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/commands/launch.py", line 947, in main stderr: launch_command(args) stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/commands/launch.py", line 932, in launch_command stderr: multi_gpu_launcher(args) stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/commands/launch.py", line 627, in multi_gpu_launcher stderr: distrib_run.run(args) stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/torch/distributed/run.py", line 785, in run stderr: elastic_launch( stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 134, in call stderr: return launch_agent(self._config, self._entrypoint, list(args)) stderr: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ stderr: File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent stderr: raise ChildFailedError( stderr: torch.distributed.elastic.multiprocessing.errors.ChildFailedError: stderr: ============================================================ stderr: /home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/test_utils/scripts/test_script.py FAILED stderr: ------------------------------------------------------------ stderr: Failures: stderr: [1]: stderr: time : 2023-06-15_01:04:01 stderr: host : dgx-a100-02.cse.iith.ac.in stderr: rank : 1 (local_rank: 1) stderr: exitcode : 1 (pid: 261883) stderr: error_file: <N/A> stderr: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html stderr: ------------------------------------------------------------ stderr: Root Cause (first observed failure): stderr: [0]: stderr: time : 2023-06-15_01:04:01 stderr: host : dgx-a100-02.cse.iith.ac.in stderr: rank : 0 (local_rank: 0) stderr: exitcode : 1 (pid: 261882) stderr: error_file: <N/A> stderr: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html stderr: ============================================================ Traceback (most recent call last): File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/bin/accelerate", line 8, in sys.exit(main()) ^^^^^^ File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main args.func(args) File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/commands/test.py", line 54, in test_command result = execute_subprocess_async(cmd, env=os.environ.copy()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/test_utils/testing.py", line 383, in execute_subprocess_async raise RuntimeError( RuntimeError: 'accelerate-launch /home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/test_utils/scripts/test_script.py' failed with returncode 1
The combined stderr from workers follows:
Traceback (most recent call last): File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/test_utils/scripts/test_script.py", line 553, in Traceback (most recent call last): File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/test_utils/scripts/test_script.py", line 553, in main() File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/test_utils/scripts/test_script.py", line 519, in main main() File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/test_utils/scripts/test_script.py", line 519, in main process_execution_check() File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/test_utils/scripts/test_script.py", line 65, in process_execution_check process_execution_check() File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/test_utils/scripts/test_script.py", line 65, in process_execution_check with accelerator.main_process_first(): File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/contextlib.py", line 144, in exit with accelerator.main_process_first(): File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/contextlib.py", line 137, in enter next(self.gen) File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/accelerator.py", line 788, in main_process_first return next(self.gen) ^^^^^^^^^^^^^^ File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/accelerator.py", line 788, in main_process_first with self.state.main_process_first(): File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/contextlib.py", line 144, in exit with self.state.main_process_first(): File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/contextlib.py", line 137, in enter next(self.gen) File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/state.py", line 864, in main_process_first return next(self.gen) ^^^^^^^^^^^^^^ File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/state.py", line 864, in main_process_first with PartialState().main_process_first(): File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/contextlib.py", line 144, in exit with PartialState().main_process_first(): File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/contextlib.py", line 137, in enter next(self.gen) File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/state.py", line 439, in main_process_first return next(self.gen) ^^^^^^^^^^^^^^ File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/state.py", line 439, in main_process_first yield from self._goes_first(self.is_main_process) File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/state.py", line 340, in _goes_first yield from self._goes_first(self.is_main_process) File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/state.py", line 335, in _goes_first self.wait_for_everyone() File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/state.py", line 329, in wait_for_everyone self.wait_for_everyone() File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/state.py", line 329, in wait_for_everyone torch.distributed.barrier() File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 3328, in barrier torch.distributed.barrier() File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 3328, in barrier work = default_pg.barrier(opts=opts) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1275, internal error, NCCL version 2.14.3 ncclInternalError: Internal check failed. Last error: Cuda failure 'API call is not supported in the installed CUDA driver' work = default_pg.barrier(opts=opts) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1275, internal error, NCCL version 2.14.3 ncclInternalError: Internal check failed. Last error: Cuda failure 'API call is not supported in the installed CUDA driver' ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 261882) of binary: /home/student/2022/ai22mtech12005/miniconda3/envs/climb/bin/python Traceback (most recent call last): File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/bin/accelerate-launch", line 8, in sys.exit(main()) ^^^^^^ File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/commands/launch.py", line 947, in main launch_command(args) File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/commands/launch.py", line 932, in launch_command multi_gpu_launcher(args) File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/commands/launch.py", line 627, in multi_gpu_launcher distrib_run.run(args) File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
/home/student/2022/ai22mtech12005/miniconda3/envs/climb/lib/python3.11/site-packages/accelerate/test_utils/scripts/test_script.py FAILED
Failures:
[1]: time : 2023-06-15_01:04:01 host : dgx-a100 rank : 1 (local_rank: 1) exitcode : 1 (pid: 261883) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Root Cause (first observed failure):
[0]: time : 2023-06-15_01:04:01 host : dgx-a100 rank : 0 (local_rank: 0) exitcode : 1 (pid: 261882) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Hi there~ I encountered exactly the same problem, have you solved it yet? If you happen to know the possible solution, please show me some hints. Thank you so much!
@Lanxin1011 i still haven't resolved issue . But to my knowledge the solution is updating nvidia driver version .
I think i have a similar problem
this is my env
-
Accelerate
version: 0.27.2 - Platform: Linux-5.4.0-96-generic-x86_64-with-glibc2.31
- Python version: 3.9.5
- Numpy version: 1.26.2
- PyTorch version (GPU?): 2.2.0+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- System RAM: 1007.76 GB
- GPU type: NVIDIA A40
-
Accelerate
default config: - compute_environment: LOCAL_MACHINE - distributed_type: FSDP - mixed_precision: bf16 - use_cpu: False - debug: False - num_processes: 3 - machine_rank: 0 - num_machines: 1 - rdzv_backend: static - same_network: True - main_training_function: main - fsdp_config: {'fsdp_auto_wrap_policy': 'TRANSFORMER_BASED_WRAP', 'fsdp_backward_prefetch': 'BACKWARD_PRE', 'fsdp_cpu_ram_efficient_loading': True, 'fsdp_forward_prefetch': False, 'fsdp_offload_params': False, 'fsdp_sharding_strategy': 'FULL_SHARD', 'fsdp_state_dict_type': 'SHARDED_STATE_DICT', 'fsdp_sync_module_states': True, 'fsdp_transformer_layer_cls_to_wrap': 'LlamaDecoderLayer', 'fsdp_use_orig_params': True} - downcast_bf16: no - tpu_use_cluster: False - tpu_use_sudo: False - tpu_env:
and accelerate test log is
Running: accelerate-launch /usr/local/lib/python3.9/dist-packages/accelerate/test_utils/scripts/test_script.py stderr: Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. stdout: Initialization stdout: Testing, testing. 1, 2, 3. stdout: Distributed environment: FSDP Backend: nccl stdout: Num processes: 3 stdout: Process index: 0 stdout: Local process index: 0 stdout: Device: cuda:0 stdout: stdout: Mixed precision type: bf16 stdout: stdout: Distributed environment: FSDP Backend: nccl stdout: Num processes: 3 stdout: Process index: 1 stdout: Local process index: 1 stdout: Device: cuda:1 stdout: stdout: Mixed precision type: bf16 stdout: stdout: Distributed environment: FSDP Backend: nccl stdout: Num processes: 3 stdout: Process index: 2 stdout: Local process index: 2 stdout: Device: cuda:2 stdout: stdout: Mixed precision type: bf16 stdout: stdout: stdout: Test process execution stderr: Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. stdout: stdout: Test split between processes as a list stdout: stdout: Test split between processes as a dict stdout: stdout: Test split between processes as a tensor stdout: stdout: Test random number generator synchronization stdout: All rng are properly synched. stdout: stdout: DataLoader integration test stdout: 1 0 2 tensor([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, stdout: 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, stdout: 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, stdout: 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, stdout: 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, stdout: 90, 91, 92, 93, 94, 95], device='cuda:0') <class 'accelerate.data_loader.DataLoaderShard'> stdout: tensor([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, stdout: 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, stdout: 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, stdout: 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, stdout: 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, stdout: 90, 91, 92, 93, 94, 95], device='cuda:2') <class 'accelerate.data_loader.DataLoaderShard'> stdout: tensor([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, stdout: 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, stdout: 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, stdout: 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, stdout: 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, stdout: 90, 91, 92, 93, 94, 95], device='cuda:1') <class 'accelerate.data_loader.DataLoaderShard'> stderr: Traceback (most recent call last): stderr: File "/usr/local/lib/python3.9/dist-packages/accelerate/test_utils/scripts/test_script.py", line 708, in
stderr: Traceback (most recent call last): stderr: File "/usr/local/lib/python3.9/dist-packages/accelerate/test_utils/scripts/test_script.py", line 708, in stderr: Traceback (most recent call last): stderr: File "/usr/local/lib/python3.9/dist-packages/accelerate/test_utils/scripts/test_script.py", line 708, in stderr: main() stderr: File "/usr/local/lib/python3.9/dist-packages/accelerate/test_utils/scripts/test_script.py", line 687, in main stderr: main() stderr: File "/usr/local/lib/python3.9/dist-packages/accelerate/test_utils/scripts/test_script.py", line 687, in main stderr: dl_preparation_check() stderr: File "/usr/local/lib/python3.9/dist-packages/accelerate/test_utils/scripts/test_script.py", line 196, in dl_preparation_check stderr: dl = prepare_data_loader( stderr: File "/usr/local/lib/python3.9/dist-packages/accelerate/data_loader.py", line 858, in prepare_data_loader stderr: dl_preparation_check() stderr: File "/usr/local/lib/python3.9/dist-packages/accelerate/test_utils/scripts/test_script.py", line 196, in dl_preparation_check stderr: main() stderr: dl = prepare_data_loader( File "/usr/local/lib/python3.9/dist-packages/accelerate/test_utils/scripts/test_script.py", line 687, in main stderr: stderr: File "/usr/local/lib/python3.9/dist-packages/accelerate/data_loader.py", line 858, in prepare_data_loader stderr: raise ValueError( stderr: ValueError: To use a DataLoader
insplit_batches
mode, the batch size (8) needs to be a round multiple of the number of processes (3). stderr: dl_preparation_check() stderr: File "/usr/local/lib/python3.9/dist-packages/accelerate/test_utils/scripts/test_script.py", line 196, in dl_preparation_check stderr: raise ValueError( stderr: ValueError: To use aDataLoader
insplit_batches
mode, the batch size (8) needs to be a round multiple of the number of processes (3). stderr: dl = prepare_data_loader( stderr: File "/usr/local/lib/python3.9/dist-packages/accelerate/data_loader.py", line 858, in prepare_data_loader stderr: raise ValueError( stderr: ValueError: To use aDataLoader
insplit_batches
mode, the batch size (8) needs to be a round multiple of the number of processes (3). stderr: [2024-03-14 13:39:21,081] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 27324) of binary: /usr/bin/python3.9 stderr: Traceback (most recent call last): stderr: File "/usr/local/bin/accelerate-launch", line 8, instderr: sys.exit(main()) stderr: File "/usr/local/lib/python3.9/dist-packages/accelerate/commands/launch.py", line 1029, in main stderr: launch_command(args) stderr: File "/usr/local/lib/python3.9/dist-packages/accelerate/commands/launch.py", line 1010, in launch_command stderr: multi_gpu_launcher(args) stderr: File "/usr/local/lib/python3.9/dist-packages/accelerate/commands/launch.py", line 672, in multi_gpu_launcher stderr: distrib_run.run(args) stderr: File "/usr/local/lib/python3.9/dist-packages/torch/distributed/run.py", line 803, in run stderr: elastic_launch( stderr: File "/usr/local/lib/python3.9/dist-packages/torch/distributed/launcher/api.py", line 135, in call stderr: return launch_agent(self._config, self._entrypoint, list(args)) stderr: File "/usr/local/lib/python3.9/dist-packages/torch/distributed/launcher/api.py", line 268, in launch_agent stderr: raise ChildFailedError( stderr: torch.distributed.elastic.multiprocessing.errors.ChildFailedError: stderr: ============================================================ stderr: /usr/local/lib/python3.9/dist-packages/accelerate/test_utils/scripts/test_script.py FAILED stderr: ------------------------------------------------------------ stderr: Failures: stderr: [1]: stderr: time : 2024-03-14_13:39:21 stderr: host : 919c8ff8c821 stderr: rank : 1 (local_rank: 1) stderr: exitcode : 1 (pid: 27325) stderr: error_file: <N/A> stderr: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html stderr: [2]: stderr: time : 2024-03-14_13:39:21 stderr: host : 919c8ff8c821 stderr: rank : 2 (local_rank: 2) stderr: exitcode : 1 (pid: 27326) stderr: error_file: <N/A> stderr: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html stderr: ------------------------------------------------------------ stderr: Root Cause (first observed failure): stderr: [0]: stderr: time : 2024-03-14_13:39:21 stderr: host : 919c8ff8c821 stderr: rank : 0 (local_rank: 0) stderr: exitcode : 1 (pid: 27324) stderr: error_file: <N/A> stderr: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html`
can someone help me?