accelerate icon indicating copy to clipboard operation
accelerate copied to clipboard

`Accelerate test` failing on an 8-GPU docker instance

Open chaselambda opened this issue 2 years ago • 6 comments

System Info

Host machine:
- Docker version 20.10.22, build 3a2c30b
- 8x A100, 80GB

Docker file:

FROM nvidia/cuda:11.6.2-devel-ubuntu20.04

RUN apt update --fix-missing
RUN apt install -y vim git
RUN apt install -y python3 python3-dev python3-pip
RUN ln -sf python3 /usr/bin/python

RUN pip3 install --no-cache --upgrade pip setuptools
RUN pip3 install accelerate

COPY accelerate_config.yaml /root/.cache/huggingface/accelerate/default_config.yaml

Accelerate config:

compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: MULTI_GPU
downcast_bf16: 'no'
fsdp_config: {}
gpu_ids: all
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
use_cpu: false

Accelerate env:

- `Accelerate` version: 0.15.0
- Platform: Linux-5.4.0-1078-kvm-x86_64-with-glibc2.29
- Python version: 3.8.10
- Numpy version: 1.24.0
- PyTorch version (GPU?): 1.13.1+cu117 (True)
- `Accelerate` default config:
	- compute_environment: LOCAL_MACHINE
	- distributed_type: MULTI_GPU
	- mixed_precision: no
	- use_cpu: False
	- dynamo_backend: NO
	- num_processes: 8
	- machine_rank: 0
	- num_machines: 1
	- gpu_ids: all
	- main_process_ip: None
	- main_process_port: None
	- rdzv_backend: static
	- same_network: True
	- main_training_function: main
	- deepspeed_config: {}
	- fsdp_config: {}
	- megatron_lm_config: {}
	- downcast_bf16: no
	- tpu_name: None
	- tpu_zone: None
	- command_file: None
	- commands: None

Information

  • [X] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [X] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • [ ] My own task or dataset (give details below)

Reproduction/Expectation/Core Issue

  • Run the above docker instance with 8 GPUs passed through
  • Run accelerate test

I get the error stderr: ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 0 (pid: 35215) of binary: /usr/bin/python3.

This test works fine and has no errors on the bare metal machine.

Do you all have any ideas what might be going on?

What about CPU-only config?

I also tried a CPU-only config, but accelerate test gave me assertion errors (both for docker and bare metal):

stderr: Traceback (most recent call last):
stderr:   File "/usr/local/lib/python3.8/dist-packages/accelerate/test_utils/scripts/test_script.py", line 359, in <module>
stderr:     main()
stderr:   File "/usr/local/lib/python3.8/dist-packages/accelerate/test_utils/scripts/test_script.py", line 350, in main
stderr:     training_check()
stderr:   File "/usr/local/lib/python3.8/dist-packages/accelerate/test_utils/scripts/test_script.py", line 271, in training_check
stderr:     accelerator.backward(loss)
stderr:   File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1314, in backward
stderr:     self.scaler.scale(loss).backward(**kwargs)
stderr:   File "/usr/local/lib/python3.8/dist-packages/torch/cuda/amp/grad_scaler.py", line 164, in scale
stderr:     assert outputs.is_cuda or outputs.device.type == 'xla'
stderr: AssertionError

The environment:

- `Accelerate` version: 0.15.0
- Platform: Linux-5.4.0-1078-kvm-x86_64-with-glibc2.29
- Python version: 3.8.10
- Numpy version: 1.24.0
- PyTorch version (GPU?): 1.13.1+cu117 (True)
- `Accelerate` default config:
	- compute_environment: LOCAL_MACHINE
	- distributed_type: MULTI_CPU
	- mixed_precision: no
	- use_cpu: True
	- dynamo_backend: NO
	- num_processes: 4
	- machine_rank: 0
	- num_machines: 1
	- gpu_ids: None
	- main_process_ip: None
	- main_process_port: None
	- rdzv_backend: static
	- same_network: True
	- main_training_function: main
	- deepspeed_config: {}
	- fsdp_config: {}
	- megatron_lm_config: {}
	- downcast_bf16: no
	- tpu_name: None
	- tpu_zone: None
	- command_file: None
	- commands: None

chaselambda avatar Dec 20 '22 19:12 chaselambda

I also failed accelerate test on a machine with two A100 (80GB) GPUs. Here is my output:

Running:  accelerate-launch --config_file=None /home/user/anaconda3/envs/task_temp/lib/python3.10/site-packages/accelerate/test_utils/scripts/test_script.py
stdout: **Initialization**
stdout: Testing, testing. 1, 2, 3.
stdout: Distributed environment: MULTI_GPU  Backend: nccl
stdout: Num processes: 2
stdout: Process index: 0
stdout: Local process index: 0
stdout: Device: cuda:0
stdout: Mixed precision type: no
stdout: 
stdout: 
stdout: **Test random number generator synchronization**
stdout: Distributed environment: MULTI_GPU  Backend: nccl
stdout: Num processes: 2
stdout: Process index: 1
stdout: Local process index: 1
stdout: Device: cuda:1
stdout: Mixed precision type: no
stdout: 
stderr: [E ProcessGroupNCCL.cpp:821] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1808182 milliseconds before timing out.
stderr: [E ProcessGroupNCCL.cpp:821] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1808116 milliseconds before timing out.
stderr: Traceback (most recent call last):
stderr:   File "/home/user/anaconda3/envs/task_temp/lib/python3.10/site-packages/accelerate/test_utils/scripts/test_script.py", line 359, in <module>
stderr:     main()
stderr:   File "/home/user/anaconda3/envs/task_temp/lib/python3.10/site-packages/accelerate/test_utils/scripts/test_script.py", line 336, in main
stderr:     rng_sync_check()
stderr:   File "/home/user/anaconda3/envs/task_temp/lib/python3.10/site-packages/accelerate/test_utils/scripts/test_script.py", line 44, in rng_sync_check
stderr:     synchronize_rng_states(["torch"])
stderr:   File "/home/user/anaconda3/envs/task_temp/lib/python3.10/site-packages/accelerate/utils/random.py", line 88, in synchronize_rng_states
stderr:     synchronize_rng_state(RNGType(rng_type), generator=generator)
stderr:   File "/home/user/anaconda3/envs/task_temp/lib/python3.10/site-packages/accelerate/utils/random.py", line 77, in synchronize_rng_state
stderr:     torch.set_rng_state(rng_state)
stderr:   File "/home/user/anaconda3/envs/task_temp/lib/python3.10/site-packages/torch/random.py", line 18, in set_rng_state
stderr:     default_generator.set_state(new_state)
stderr: RuntimeError: Invalid mt19937 state
stderr: Traceback (most recent call last):
stderr:   File "/home/user/anaconda3/envs/task_temp/lib/python3.10/site-packages/accelerate/test_utils/scripts/test_script.py", line 359, in <module>
stderr: [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
stderr: [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down.
stderr:     main()
stderr:   File "/home/user/anaconda3/envs/task_temp/lib/python3.10/site-packages/accelerate/test_utils/scripts/test_script.py", line 336, in main
stderr:     rng_sync_check()
stderr:   File "/home/user/anaconda3/envs/task_temp/lib/python3.10/site-packages/accelerate/test_utils/scripts/test_script.py", line 45, in rng_sync_check
stderr:     assert are_the_same_tensors(torch.get_rng_state()), "RNG states improperly synchronized on CPU."
stderr: AssertionError: RNG states improperly synchronized on CPU.
stderr: ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 859423) of binary: /home/user/anaconda3/envs/task_temp/bin/python
Test is a success! You are ready for your distributed training!

youngwoo-yoon avatar Jan 02 '23 06:01 youngwoo-yoon

For my case (RNG states improperly synchronized on CPU), PCI ACS was the problem. Disabling it in BIOS solved the problem. See https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html#pci-access-control-services-acs

youngwoo-yoon avatar Mar 02 '23 05:03 youngwoo-yoon

I do not understand what happens when running the accelerate test command when I am using A100 (80GB) GPUs

Copy and paste the text below into your GitHub issue

- `Accelerate` version: 0.18.0
- Platform: Linux-5.4.0-132-generic-x86_64-with-glibc2.29
- Python version: 3.8.10
- Numpy version: 1.23.5
- PyTorch version (GPU?): 1.9.1+cu111 (True)
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: DEEPSPEED
        - mixed_precision: fp16
        - use_cpu: False
        - num_processes: 2
        - machine_rank: 0
        - num_machines: 1
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - deepspeed_config: {'gradient_accumulation_steps': 2, 'gradient_clipping': 1.0, 'offload_optimizer_device': 'none', 'offload_param_device': 'none', 'zero3_init_flag': True, 'zero_stage': 2}
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []
        - dynamo_config: {'dynamo_backend': 'INDUCTOR', 'dynamo_mode': 'default', 'dynamo_use_dynamic': True, 'dynamo_use_fullgraph': False}
        ```
        it doesn't show an error and does not tell successful  test as [issue](https://github.com/Mohammed20201991/DataSets/blob/main/issue_with%20accelreate%20training.JPG)
        

Mohammed20201991 avatar Mar 27 '23 15:03 Mohammed20201991

Any updates on this issue? I have the same problem running accelerate test on my machine with 4 Tesla V100: stderr: ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 0

It works fine if i select only 2 GPUs on Multi-GPU config, but 3 or 4 raises the error.

andrefz avatar May 09 '23 14:05 andrefz

I can run with a small amount of data, but large-scale data can cause errors:

Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.

965694547 avatar May 24 '23 06:05 965694547

I can run with a small amount of data, but large-scale data can cause errors:

Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.

我也遇到同样的问题了,请问怎么解决了?

JerryDaHeLian avatar Mar 20 '24 00:03 JerryDaHeLian