accelerate icon indicating copy to clipboard operation
accelerate copied to clipboard

stderr: AssertionError: Main process was not first

Open Geministudents opened this issue 1 year ago • 21 comments

a problem when "accelerate test"

Running:  accelerate-launch /venv/L_E_T/lib/python3.8/site-packages/accelerate/test_utils/scripts/test_script.py
stdout: Wandb import failed
stdout: Wandb import failed
stdout: Wandb import failed
stdout: Wandb import failed
stdout: **Initialization**
stdout: Testing, testing. 1, 2, 3.
stdout: Distributed environment: MEGATRON_LM  Backend: nccl
stdout: Num processes: 2
stdout: Process index: 0
stdout: Local process index: 0
stdout: Device: cuda:0
stdout: 
stdout: Mixed precision type: fp16
stdout: 
stdout: 
stdout: **Test process execution**
stdout: 
stdout: **Test split between processes as a list**
stdout: 
stdout: **Test split between processes as a dict**
stderr: Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
stdout: Distributed environment: MEGATRON_LM  Backend: nccl
stdout: Num processes: 2
stdout: Process index: 1
stdout: Local process index: 1
stdout: Device: cuda:1
stdout: 
stdout: Mixed precision type: fp16
stdout: 
stdout: 
stdout: **Test split between processes as a tensor**
stderr: Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
stderr: Traceback (most recent call last):
stderr:   File "/venv/L_E_T/lib/python3.8/site-packages/accelerate/test_utils/scripts/test_script.py", line 626, in <module>
stderr:     main()
stderr:   File "/venv/L_E_T/lib/python3.8/site-packages/accelerate/test_utils/scripts/test_script.py", line 588, in main
stderr:     process_execution_check()
stderr:   File "/venv/L_E_T/lib/python3.8/site-packages/accelerate/test_utils/scripts/test_script.py", line 85, in process_execution_check
stderr:     assert text.startswith("Currently in the main process\n"), "Main process was not first"
stderr: AssertionError: Main process was not first
stdout: 
stdout: **Test random number generator synchronization**
stderr: WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 12484 closing signal SIGTERM
stderr: ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 12485) of binary: /venv/L_E_T/bin/python
stderr: Traceback (most recent call last):
stderr:   File "/venv/L_E_T/bin/accelerate-launch", line 8, in <module>
stderr:     sys.exit(main())
stderr:   File "/venv/L_E_T/lib/python3.8/site-packages/accelerate/commands/launch.py", line 1000, in main
stderr:     launch_command(args)
stderr:   File "/venv/L_E_T/lib/python3.8/site-packages/accelerate/commands/launch.py", line 983, in launch_command
stderr:     multi_gpu_launcher(args)
stderr:   File "/venv/L_E_T/lib/python3.8/site-packages/accelerate/commands/launch.py", line 654, in multi_gpu_launcher
stderr:     distrib_run.run(args)
stderr:   File "/venv/L_E_T/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
stderr:     elastic_launch(
stderr:   File "/venv/L_E_T/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
stderr:     return launch_agent(self._config, self._entrypoint, list(args))
stderr:   File "/venv/L_E_T/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
stderr:     raise ChildFailedError(
stderr: torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
stderr: ============================================================
stderr: /venv/L_E_T/lib/python3.8/site-packages/accelerate/test_utils/scripts/test_script.py FAILED
stderr: ------------------------------------------------------------
stderr: Failures:
stderr:   <NO_OTHER_FAILURES>
stderr: ------------------------------------------------------------
stderr: Root Cause (first observed failure):
stderr: [0]:
stderr:   time      : 2023-10-26_16:34:39
stderr:   host      : gpu-8-36
stderr:   rank      : 1 (local_rank: 1)
stderr:   exitcode  : 1 (pid: 12485)
stderr:   error_file: <N/A>
stderr:   traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
stderr: ============================================================
Traceback (most recent call last):
  File "/venv/L_E_T/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/venv/L_E_T/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/venv/L_E_T/lib/python3.8/site-packages/accelerate/commands/test.py", line 54, in test_command
    result = execute_subprocess_async(cmd, env=os.environ.copy())
  File "/venv/L_E_T/lib/python3.8/site-packages/accelerate/test_utils/testing.py", line 407, in execute_subprocess_async
    raise RuntimeError(
RuntimeError: 'accelerate-launch /venv/L_E_T/lib/python3.8/site-packages/accelerate/test_utils/scripts/test_script.py' failed with returncode 1

The combined stderr from workers follows:
Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
Traceback (most recent call last):
  File "/venv/L_E_T/lib/python3.8/site-packages/accelerate/test_utils/scripts/test_script.py", line 626, in <module>
    main()
  File "/venv/L_E_T/lib/python3.8/site-packages/accelerate/test_utils/scripts/test_script.py", line 588, in main
    process_execution_check()
  File "/venv/L_E_T/lib/python3.8/site-packages/accelerate/test_utils/scripts/test_script.py", line 85, in process_execution_check
    assert text.startswith("Currently in the main process\n"), "Main process was not first"
AssertionError: Main process was not first
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 12484 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 12485) of binary: /venv/L_E_T/bin/python
Traceback (most recent call last):
  File "/venv/L_E_T/bin/accelerate-launch", line 8, in <module>
    sys.exit(main())
  File "/venv/L_E_T/lib/python3.8/site-packages/accelerate/commands/launch.py", line 1000, in main
    launch_command(args)
  File "/venv/L_E_T/lib/python3.8/site-packages/accelerate/commands/launch.py", line 983, in launch_command
    multi_gpu_launcher(args)
  File "/venv/L_E_T/lib/python3.8/site-packages/accelerate/commands/launch.py", line 654, in multi_gpu_launcher
    distrib_run.run(args)
  File "/venv/L_E_T/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/venv/L_E_T/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/venv/L_E_T/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/venv/L_E_T/lib/python3.8/site-packages/accelerate/test_utils/scripts/test_script.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-10-26_16:34:39
  host      : gpu-8-36
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 12485)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

config.yaml:

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MEGATRON_LM
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
megatron_lm_config:
  megatron_lm_gradient_clipping: 1.0
  megatron_lm_num_micro_batches: 2
  megatron_lm_pp_degree: 2
  megatron_lm_recompute_activations: true
  megatron_lm_sequence_parallelism: true
  megatron_lm_tp_degree: 2
  megatron_lm_use_distributed_optimizer: true
mixed_precision: fp16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
main_process_port: 1234
use_cpu: false

Hope for help

Geministudents avatar Oct 26 '23 08:10 Geministudents