accelerate
accelerate copied to clipboard
`Accelerate test` failing on an 8-GPU docker instance
System Info
Host machine:
- Docker version 20.10.22, build 3a2c30b
- 8x A100, 80GB
Docker file:
FROM nvidia/cuda:11.6.2-devel-ubuntu20.04
RUN apt update --fix-missing
RUN apt install -y vim git
RUN apt install -y python3 python3-dev python3-pip
RUN ln -sf python3 /usr/bin/python
RUN pip3 install --no-cache --upgrade pip setuptools
RUN pip3 install accelerate
COPY accelerate_config.yaml /root/.cache/huggingface/accelerate/default_config.yaml
Accelerate config:
compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: MULTI_GPU
downcast_bf16: 'no'
fsdp_config: {}
gpu_ids: all
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
use_cpu: false
Accelerate env:
- `Accelerate` version: 0.15.0
- Platform: Linux-5.4.0-1078-kvm-x86_64-with-glibc2.29
- Python version: 3.8.10
- Numpy version: 1.24.0
- PyTorch version (GPU?): 1.13.1+cu117 (True)
- `Accelerate` default config:
- compute_environment: LOCAL_MACHINE
- distributed_type: MULTI_GPU
- mixed_precision: no
- use_cpu: False
- dynamo_backend: NO
- num_processes: 8
- machine_rank: 0
- num_machines: 1
- gpu_ids: all
- main_process_ip: None
- main_process_port: None
- rdzv_backend: static
- same_network: True
- main_training_function: main
- deepspeed_config: {}
- fsdp_config: {}
- megatron_lm_config: {}
- downcast_bf16: no
- tpu_name: None
- tpu_zone: None
- command_file: None
- commands: None
Information
- [X] The official example scripts
- [ ] My own modified scripts
Tasks
- [X] One of the scripts in the examples/ folder of Accelerate or an officially supported
no_trainerscript in theexamplesfolder of thetransformersrepo (such asrun_no_trainer_glue.py) - [ ] My own task or dataset (give details below)
Reproduction/Expectation/Core Issue
- Run the above docker instance with 8 GPUs passed through
- Run
accelerate test
I get the error stderr: ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 0 (pid: 35215) of binary: /usr/bin/python3.
This test works fine and has no errors on the bare metal machine.
Do you all have any ideas what might be going on?
What about CPU-only config?
I also tried a CPU-only config, but accelerate test gave me assertion errors (both for docker and bare metal):
stderr: Traceback (most recent call last):
stderr: File "/usr/local/lib/python3.8/dist-packages/accelerate/test_utils/scripts/test_script.py", line 359, in <module>
stderr: main()
stderr: File "/usr/local/lib/python3.8/dist-packages/accelerate/test_utils/scripts/test_script.py", line 350, in main
stderr: training_check()
stderr: File "/usr/local/lib/python3.8/dist-packages/accelerate/test_utils/scripts/test_script.py", line 271, in training_check
stderr: accelerator.backward(loss)
stderr: File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1314, in backward
stderr: self.scaler.scale(loss).backward(**kwargs)
stderr: File "/usr/local/lib/python3.8/dist-packages/torch/cuda/amp/grad_scaler.py", line 164, in scale
stderr: assert outputs.is_cuda or outputs.device.type == 'xla'
stderr: AssertionError
The environment:
- `Accelerate` version: 0.15.0
- Platform: Linux-5.4.0-1078-kvm-x86_64-with-glibc2.29
- Python version: 3.8.10
- Numpy version: 1.24.0
- PyTorch version (GPU?): 1.13.1+cu117 (True)
- `Accelerate` default config:
- compute_environment: LOCAL_MACHINE
- distributed_type: MULTI_CPU
- mixed_precision: no
- use_cpu: True
- dynamo_backend: NO
- num_processes: 4
- machine_rank: 0
- num_machines: 1
- gpu_ids: None
- main_process_ip: None
- main_process_port: None
- rdzv_backend: static
- same_network: True
- main_training_function: main
- deepspeed_config: {}
- fsdp_config: {}
- megatron_lm_config: {}
- downcast_bf16: no
- tpu_name: None
- tpu_zone: None
- command_file: None
- commands: None
I also failed accelerate test on a machine with two A100 (80GB) GPUs.
Here is my output:
Running: accelerate-launch --config_file=None /home/user/anaconda3/envs/task_temp/lib/python3.10/site-packages/accelerate/test_utils/scripts/test_script.py
stdout: **Initialization**
stdout: Testing, testing. 1, 2, 3.
stdout: Distributed environment: MULTI_GPU Backend: nccl
stdout: Num processes: 2
stdout: Process index: 0
stdout: Local process index: 0
stdout: Device: cuda:0
stdout: Mixed precision type: no
stdout:
stdout:
stdout: **Test random number generator synchronization**
stdout: Distributed environment: MULTI_GPU Backend: nccl
stdout: Num processes: 2
stdout: Process index: 1
stdout: Local process index: 1
stdout: Device: cuda:1
stdout: Mixed precision type: no
stdout:
stderr: [E ProcessGroupNCCL.cpp:821] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1808182 milliseconds before timing out.
stderr: [E ProcessGroupNCCL.cpp:821] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1808116 milliseconds before timing out.
stderr: Traceback (most recent call last):
stderr: File "/home/user/anaconda3/envs/task_temp/lib/python3.10/site-packages/accelerate/test_utils/scripts/test_script.py", line 359, in <module>
stderr: main()
stderr: File "/home/user/anaconda3/envs/task_temp/lib/python3.10/site-packages/accelerate/test_utils/scripts/test_script.py", line 336, in main
stderr: rng_sync_check()
stderr: File "/home/user/anaconda3/envs/task_temp/lib/python3.10/site-packages/accelerate/test_utils/scripts/test_script.py", line 44, in rng_sync_check
stderr: synchronize_rng_states(["torch"])
stderr: File "/home/user/anaconda3/envs/task_temp/lib/python3.10/site-packages/accelerate/utils/random.py", line 88, in synchronize_rng_states
stderr: synchronize_rng_state(RNGType(rng_type), generator=generator)
stderr: File "/home/user/anaconda3/envs/task_temp/lib/python3.10/site-packages/accelerate/utils/random.py", line 77, in synchronize_rng_state
stderr: torch.set_rng_state(rng_state)
stderr: File "/home/user/anaconda3/envs/task_temp/lib/python3.10/site-packages/torch/random.py", line 18, in set_rng_state
stderr: default_generator.set_state(new_state)
stderr: RuntimeError: Invalid mt19937 state
stderr: Traceback (most recent call last):
stderr: File "/home/user/anaconda3/envs/task_temp/lib/python3.10/site-packages/accelerate/test_utils/scripts/test_script.py", line 359, in <module>
stderr: [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
stderr: [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down.
stderr: main()
stderr: File "/home/user/anaconda3/envs/task_temp/lib/python3.10/site-packages/accelerate/test_utils/scripts/test_script.py", line 336, in main
stderr: rng_sync_check()
stderr: File "/home/user/anaconda3/envs/task_temp/lib/python3.10/site-packages/accelerate/test_utils/scripts/test_script.py", line 45, in rng_sync_check
stderr: assert are_the_same_tensors(torch.get_rng_state()), "RNG states improperly synchronized on CPU."
stderr: AssertionError: RNG states improperly synchronized on CPU.
stderr: ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 859423) of binary: /home/user/anaconda3/envs/task_temp/bin/python
Test is a success! You are ready for your distributed training!
For my case (RNG states improperly synchronized on CPU), PCI ACS was the problem. Disabling it in BIOS solved the problem. See https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html#pci-access-control-services-acs
I do not understand what happens when running the accelerate test command when I am using A100 (80GB) GPUs
Copy and paste the text below into your GitHub issue
- `Accelerate` version: 0.18.0
- Platform: Linux-5.4.0-132-generic-x86_64-with-glibc2.29
- Python version: 3.8.10
- Numpy version: 1.23.5
- PyTorch version (GPU?): 1.9.1+cu111 (True)
- `Accelerate` default config:
- compute_environment: LOCAL_MACHINE
- distributed_type: DEEPSPEED
- mixed_precision: fp16
- use_cpu: False
- num_processes: 2
- machine_rank: 0
- num_machines: 1
- rdzv_backend: static
- same_network: True
- main_training_function: main
- deepspeed_config: {'gradient_accumulation_steps': 2, 'gradient_clipping': 1.0, 'offload_optimizer_device': 'none', 'offload_param_device': 'none', 'zero3_init_flag': True, 'zero_stage': 2}
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
- dynamo_config: {'dynamo_backend': 'INDUCTOR', 'dynamo_mode': 'default', 'dynamo_use_dynamic': True, 'dynamo_use_fullgraph': False}
```
it doesn't show an error and does not tell successful test as [issue](https://github.com/Mohammed20201991/DataSets/blob/main/issue_with%20accelreate%20training.JPG)
Any updates on this issue? I have the same problem running accelerate test on my machine with 4 Tesla V100: stderr: ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 0
It works fine if i select only 2 GPUs on Multi-GPU config, but 3 or 4 raises the error.
I can run with a small amount of data, but large-scale data can cause errors:
Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
I can run with a small amount of data, but large-scale data can cause errors:
Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
我也遇到同样的问题了,请问怎么解决了?