accelerate
accelerate copied to clipboard
`Accelerate test` failing on an 8-GPU docker instance
System Info
Host machine:
- Docker version 20.10.22, build 3a2c30b
- 8x A100, 80GB
Docker file:
FROM nvidia/cuda:11.6.2-devel-ubuntu20.04
RUN apt update --fix-missing
RUN apt install -y vim git
RUN apt install -y python3 python3-dev python3-pip
RUN ln -sf python3 /usr/bin/python
RUN pip3 install --no-cache --upgrade pip setuptools
RUN pip3 install accelerate
COPY accelerate_config.yaml /root/.cache/huggingface/accelerate/default_config.yaml
Accelerate config:
compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: MULTI_GPU
downcast_bf16: 'no'
fsdp_config: {}
gpu_ids: all
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
use_cpu: false
Accelerate env:
- `Accelerate` version: 0.15.0
- Platform: Linux-5.4.0-1078-kvm-x86_64-with-glibc2.29
- Python version: 3.8.10
- Numpy version: 1.24.0
- PyTorch version (GPU?): 1.13.1+cu117 (True)
- `Accelerate` default config:
- compute_environment: LOCAL_MACHINE
- distributed_type: MULTI_GPU
- mixed_precision: no
- use_cpu: False
- dynamo_backend: NO
- num_processes: 8
- machine_rank: 0
- num_machines: 1
- gpu_ids: all
- main_process_ip: None
- main_process_port: None
- rdzv_backend: static
- same_network: True
- main_training_function: main
- deepspeed_config: {}
- fsdp_config: {}
- megatron_lm_config: {}
- downcast_bf16: no
- tpu_name: None
- tpu_zone: None
- command_file: None
- commands: None
Information
- [X] The official example scripts
- [ ] My own modified scripts
Tasks
- [X] One of the scripts in the examples/ folder of Accelerate or an officially supported
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
) - [ ] My own task or dataset (give details below)
Reproduction/Expectation/Core Issue
- Run the above docker instance with 8 GPUs passed through
- Run
accelerate test
I get the error stderr: ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 0 (pid: 35215) of binary: /usr/bin/python3
.
This test works fine and has no errors on the bare metal machine.
Do you all have any ideas what might be going on?
What about CPU-only config?
I also tried a CPU-only config, but accelerate test
gave me assertion errors (both for docker and bare metal):
stderr: Traceback (most recent call last):
stderr: File "/usr/local/lib/python3.8/dist-packages/accelerate/test_utils/scripts/test_script.py", line 359, in <module>
stderr: main()
stderr: File "/usr/local/lib/python3.8/dist-packages/accelerate/test_utils/scripts/test_script.py", line 350, in main
stderr: training_check()
stderr: File "/usr/local/lib/python3.8/dist-packages/accelerate/test_utils/scripts/test_script.py", line 271, in training_check
stderr: accelerator.backward(loss)
stderr: File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1314, in backward
stderr: self.scaler.scale(loss).backward(**kwargs)
stderr: File "/usr/local/lib/python3.8/dist-packages/torch/cuda/amp/grad_scaler.py", line 164, in scale
stderr: assert outputs.is_cuda or outputs.device.type == 'xla'
stderr: AssertionError
The environment:
- `Accelerate` version: 0.15.0
- Platform: Linux-5.4.0-1078-kvm-x86_64-with-glibc2.29
- Python version: 3.8.10
- Numpy version: 1.24.0
- PyTorch version (GPU?): 1.13.1+cu117 (True)
- `Accelerate` default config:
- compute_environment: LOCAL_MACHINE
- distributed_type: MULTI_CPU
- mixed_precision: no
- use_cpu: True
- dynamo_backend: NO
- num_processes: 4
- machine_rank: 0
- num_machines: 1
- gpu_ids: None
- main_process_ip: None
- main_process_port: None
- rdzv_backend: static
- same_network: True
- main_training_function: main
- deepspeed_config: {}
- fsdp_config: {}
- megatron_lm_config: {}
- downcast_bf16: no
- tpu_name: None
- tpu_zone: None
- command_file: None
- commands: None
I also failed accelerate test
on a machine with two A100 (80GB) GPUs.
Here is my output:
Running: accelerate-launch --config_file=None /home/user/anaconda3/envs/task_temp/lib/python3.10/site-packages/accelerate/test_utils/scripts/test_script.py
stdout: **Initialization**
stdout: Testing, testing. 1, 2, 3.
stdout: Distributed environment: MULTI_GPU Backend: nccl
stdout: Num processes: 2
stdout: Process index: 0
stdout: Local process index: 0
stdout: Device: cuda:0
stdout: Mixed precision type: no
stdout:
stdout:
stdout: **Test random number generator synchronization**
stdout: Distributed environment: MULTI_GPU Backend: nccl
stdout: Num processes: 2
stdout: Process index: 1
stdout: Local process index: 1
stdout: Device: cuda:1
stdout: Mixed precision type: no
stdout:
stderr: [E ProcessGroupNCCL.cpp:821] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1808182 milliseconds before timing out.
stderr: [E ProcessGroupNCCL.cpp:821] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1808116 milliseconds before timing out.
stderr: Traceback (most recent call last):
stderr: File "/home/user/anaconda3/envs/task_temp/lib/python3.10/site-packages/accelerate/test_utils/scripts/test_script.py", line 359, in <module>
stderr: main()
stderr: File "/home/user/anaconda3/envs/task_temp/lib/python3.10/site-packages/accelerate/test_utils/scripts/test_script.py", line 336, in main
stderr: rng_sync_check()
stderr: File "/home/user/anaconda3/envs/task_temp/lib/python3.10/site-packages/accelerate/test_utils/scripts/test_script.py", line 44, in rng_sync_check
stderr: synchronize_rng_states(["torch"])
stderr: File "/home/user/anaconda3/envs/task_temp/lib/python3.10/site-packages/accelerate/utils/random.py", line 88, in synchronize_rng_states
stderr: synchronize_rng_state(RNGType(rng_type), generator=generator)
stderr: File "/home/user/anaconda3/envs/task_temp/lib/python3.10/site-packages/accelerate/utils/random.py", line 77, in synchronize_rng_state
stderr: torch.set_rng_state(rng_state)
stderr: File "/home/user/anaconda3/envs/task_temp/lib/python3.10/site-packages/torch/random.py", line 18, in set_rng_state
stderr: default_generator.set_state(new_state)
stderr: RuntimeError: Invalid mt19937 state
stderr: Traceback (most recent call last):
stderr: File "/home/user/anaconda3/envs/task_temp/lib/python3.10/site-packages/accelerate/test_utils/scripts/test_script.py", line 359, in <module>
stderr: [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
stderr: [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down.
stderr: main()
stderr: File "/home/user/anaconda3/envs/task_temp/lib/python3.10/site-packages/accelerate/test_utils/scripts/test_script.py", line 336, in main
stderr: rng_sync_check()
stderr: File "/home/user/anaconda3/envs/task_temp/lib/python3.10/site-packages/accelerate/test_utils/scripts/test_script.py", line 45, in rng_sync_check
stderr: assert are_the_same_tensors(torch.get_rng_state()), "RNG states improperly synchronized on CPU."
stderr: AssertionError: RNG states improperly synchronized on CPU.
stderr: ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 859423) of binary: /home/user/anaconda3/envs/task_temp/bin/python
Test is a success! You are ready for your distributed training!
For my case (RNG states improperly synchronized on CPU
), PCI ACS was the problem. Disabling it in BIOS solved the problem. See https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html#pci-access-control-services-acs
I do not understand what happens when running the accelerate test
command when I am using A100 (80GB) GPUs
Copy and paste the text below into your GitHub issue
- `Accelerate` version: 0.18.0
- Platform: Linux-5.4.0-132-generic-x86_64-with-glibc2.29
- Python version: 3.8.10
- Numpy version: 1.23.5
- PyTorch version (GPU?): 1.9.1+cu111 (True)
- `Accelerate` default config:
- compute_environment: LOCAL_MACHINE
- distributed_type: DEEPSPEED
- mixed_precision: fp16
- use_cpu: False
- num_processes: 2
- machine_rank: 0
- num_machines: 1
- rdzv_backend: static
- same_network: True
- main_training_function: main
- deepspeed_config: {'gradient_accumulation_steps': 2, 'gradient_clipping': 1.0, 'offload_optimizer_device': 'none', 'offload_param_device': 'none', 'zero3_init_flag': True, 'zero_stage': 2}
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
- dynamo_config: {'dynamo_backend': 'INDUCTOR', 'dynamo_mode': 'default', 'dynamo_use_dynamic': True, 'dynamo_use_fullgraph': False}
```
it doesn't show an error and does not tell successful test as [issue](https://github.com/Mohammed20201991/DataSets/blob/main/issue_with%20accelreate%20training.JPG)
Any updates on this issue? I have the same problem running accelerate test
on my machine with 4 Tesla V100: stderr: ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 0
It works fine if i select only 2 GPUs on Multi-GPU config, but 3 or 4 raises the error.
I can run with a small amount of data, but large-scale data can cause errors:
Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
I can run with a small amount of data, but large-scale data can cause errors:
Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
我也遇到同样的问题了,请问怎么解决了?