accelerate icon indicating copy to clipboard operation
accelerate copied to clipboard

stderr: AssertionError: Main process was not first

Open Geministudents opened this issue 2 years ago • 24 comments
trafficstars

a problem when "accelerate test"

Running:  accelerate-launch /venv/L_E_T/lib/python3.8/site-packages/accelerate/test_utils/scripts/test_script.py
stdout: Wandb import failed
stdout: Wandb import failed
stdout: Wandb import failed
stdout: Wandb import failed
stdout: **Initialization**
stdout: Testing, testing. 1, 2, 3.
stdout: Distributed environment: MEGATRON_LM  Backend: nccl
stdout: Num processes: 2
stdout: Process index: 0
stdout: Local process index: 0
stdout: Device: cuda:0
stdout: 
stdout: Mixed precision type: fp16
stdout: 
stdout: 
stdout: **Test process execution**
stdout: 
stdout: **Test split between processes as a list**
stdout: 
stdout: **Test split between processes as a dict**
stderr: Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
stdout: Distributed environment: MEGATRON_LM  Backend: nccl
stdout: Num processes: 2
stdout: Process index: 1
stdout: Local process index: 1
stdout: Device: cuda:1
stdout: 
stdout: Mixed precision type: fp16
stdout: 
stdout: 
stdout: **Test split between processes as a tensor**
stderr: Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
stderr: Traceback (most recent call last):
stderr:   File "/venv/L_E_T/lib/python3.8/site-packages/accelerate/test_utils/scripts/test_script.py", line 626, in <module>
stderr:     main()
stderr:   File "/venv/L_E_T/lib/python3.8/site-packages/accelerate/test_utils/scripts/test_script.py", line 588, in main
stderr:     process_execution_check()
stderr:   File "/venv/L_E_T/lib/python3.8/site-packages/accelerate/test_utils/scripts/test_script.py", line 85, in process_execution_check
stderr:     assert text.startswith("Currently in the main process\n"), "Main process was not first"
stderr: AssertionError: Main process was not first
stdout: 
stdout: **Test random number generator synchronization**
stderr: WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 12484 closing signal SIGTERM
stderr: ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 12485) of binary: /venv/L_E_T/bin/python
stderr: Traceback (most recent call last):
stderr:   File "/venv/L_E_T/bin/accelerate-launch", line 8, in <module>
stderr:     sys.exit(main())
stderr:   File "/venv/L_E_T/lib/python3.8/site-packages/accelerate/commands/launch.py", line 1000, in main
stderr:     launch_command(args)
stderr:   File "/venv/L_E_T/lib/python3.8/site-packages/accelerate/commands/launch.py", line 983, in launch_command
stderr:     multi_gpu_launcher(args)
stderr:   File "/venv/L_E_T/lib/python3.8/site-packages/accelerate/commands/launch.py", line 654, in multi_gpu_launcher
stderr:     distrib_run.run(args)
stderr:   File "/venv/L_E_T/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
stderr:     elastic_launch(
stderr:   File "/venv/L_E_T/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
stderr:     return launch_agent(self._config, self._entrypoint, list(args))
stderr:   File "/venv/L_E_T/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
stderr:     raise ChildFailedError(
stderr: torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
stderr: ============================================================
stderr: /venv/L_E_T/lib/python3.8/site-packages/accelerate/test_utils/scripts/test_script.py FAILED
stderr: ------------------------------------------------------------
stderr: Failures:
stderr:   <NO_OTHER_FAILURES>
stderr: ------------------------------------------------------------
stderr: Root Cause (first observed failure):
stderr: [0]:
stderr:   time      : 2023-10-26_16:34:39
stderr:   host      : gpu-8-36
stderr:   rank      : 1 (local_rank: 1)
stderr:   exitcode  : 1 (pid: 12485)
stderr:   error_file: <N/A>
stderr:   traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
stderr: ============================================================
Traceback (most recent call last):
  File "/venv/L_E_T/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/venv/L_E_T/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/venv/L_E_T/lib/python3.8/site-packages/accelerate/commands/test.py", line 54, in test_command
    result = execute_subprocess_async(cmd, env=os.environ.copy())
  File "/venv/L_E_T/lib/python3.8/site-packages/accelerate/test_utils/testing.py", line 407, in execute_subprocess_async
    raise RuntimeError(
RuntimeError: 'accelerate-launch /venv/L_E_T/lib/python3.8/site-packages/accelerate/test_utils/scripts/test_script.py' failed with returncode 1

The combined stderr from workers follows:
Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
Traceback (most recent call last):
  File "/venv/L_E_T/lib/python3.8/site-packages/accelerate/test_utils/scripts/test_script.py", line 626, in <module>
    main()
  File "/venv/L_E_T/lib/python3.8/site-packages/accelerate/test_utils/scripts/test_script.py", line 588, in main
    process_execution_check()
  File "/venv/L_E_T/lib/python3.8/site-packages/accelerate/test_utils/scripts/test_script.py", line 85, in process_execution_check
    assert text.startswith("Currently in the main process\n"), "Main process was not first"
AssertionError: Main process was not first
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 12484 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 12485) of binary: /venv/L_E_T/bin/python
Traceback (most recent call last):
  File "/venv/L_E_T/bin/accelerate-launch", line 8, in <module>
    sys.exit(main())
  File "/venv/L_E_T/lib/python3.8/site-packages/accelerate/commands/launch.py", line 1000, in main
    launch_command(args)
  File "/venv/L_E_T/lib/python3.8/site-packages/accelerate/commands/launch.py", line 983, in launch_command
    multi_gpu_launcher(args)
  File "/venv/L_E_T/lib/python3.8/site-packages/accelerate/commands/launch.py", line 654, in multi_gpu_launcher
    distrib_run.run(args)
  File "/venv/L_E_T/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/venv/L_E_T/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/venv/L_E_T/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/venv/L_E_T/lib/python3.8/site-packages/accelerate/test_utils/scripts/test_script.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-10-26_16:34:39
  host      : gpu-8-36
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 12485)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

config.yaml:

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MEGATRON_LM
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
megatron_lm_config:
  megatron_lm_gradient_clipping: 1.0
  megatron_lm_num_micro_batches: 2
  megatron_lm_pp_degree: 2
  megatron_lm_recompute_activations: true
  megatron_lm_sequence_parallelism: true
  megatron_lm_tp_degree: 2
  megatron_lm_use_distributed_optimizer: true
mixed_precision: fp16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
main_process_port: 1234
use_cpu: false

Hope for help

Geministudents avatar Oct 26 '23 08:10 Geministudents

I have the same question, It seem a bug in with accelerator.main_process_first():

MangoFF avatar Oct 26 '23 11:10 MangoFF

@MangoFF whats your kernel version? As the error states there, it’s a known issue if your OS kernel is below a certain point, stuff like main_process_first just will not work, it’s a limitation of PyTorch interacting with the OS. I recommend a higher Linux kernel

muellerzr avatar Oct 26 '23 11:10 muellerzr

Thanks for your reply. My ubuntu kernel: PRETTY_NAME="Ubuntu 22.04.2 LTS" NAME="Ubuntu" VERSION_ID="22.04" VERSION="22.04.2 LTS (Jammy Jellyfish)"

MangoFF avatar Oct 27 '23 01:10 MangoFF

when I change a ubuntu kernel to Ubuntu 22.04.2 LTS" 5.15.0-60-generic. It works ,Thanks

MangoFF avatar Oct 27 '23 02:10 MangoFF

Thanks for verifying @MangoFF.

What I'll do is we can keep this issue open as a reference for others. If you have issues with either:

  • Processes hang
  • Process orders are not in the right order
  • accelerate test fails because the main process was not first

And find Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. in your stack trace, you need to upgrade your linux kernel.

On our end:

If you experience this problem, and upgrading your linux kernel fixes it, it would be great for us to know a few things:

  1. The output of accelerate env
  2. What code was hanging
  3. Reacting to this comment with a 👍

If this is a wide enough issue, we will migrate this to a full RuntimeError instead of a regular warning.

muellerzr avatar Oct 27 '23 13:10 muellerzr

@muellerzr I am also getting this error:

stderr: Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.

Why do we have to upgrade the linux kernel? Was there a previous version accelerate where this should be working without the upgrade? I cannot upgrade our servers, so what previous version of accelerate should we be using?

cmosguy avatar Oct 30 '23 01:10 cmosguy

Thanks for verifying @MangoFF.

What I'll do is we can keep this issue open as a reference for others. If you have issues with either:

  • Processes hang
  • Process orders are not in the right order
  • accelerate test fails because the main process was not first

And find Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. in your stack trace, you need to upgrade your linux kernel.

On our end:

If you experience this problem, and upgrading your linux kernel fixes it, it would be great for us to know a few things:

  1. The output of accelerate env
  2. What code was hanging
  3. Reacting to this comment with a 👍

If this is a wide enough issue, we will migrate this to a full RuntimeError instead of a regular warning.

Thank you for your reply. You have solved a great problem for everyone

Geministudents avatar Oct 30 '23 03:10 Geministudents

Facing same issues.

thak123 avatar Nov 13 '23 09:11 thak123

@muellerzr I am also getting this error:

stderr: Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.

Why do we have to upgrade the linux kernel? Was there a previous version accelerate where this should be working without the upgrade? I cannot upgrade our servers, so what previous version of accelerate should we be using?

Have you resolved this problem?

lash-1997 avatar Dec 13 '23 01:12 lash-1997

Facing the same issue!

ShashwatNigam99 avatar Jan 28 '24 05:01 ShashwatNigam99

the same issue

fwyc0573 avatar Feb 08 '24 05:02 fwyc0573

There is no prior version of accelerate because this is a torch issue itself. You can try downgrading PyTorch potentially

muellerzr avatar Feb 08 '24 13:02 muellerzr

Is it torch issue? am using 2.2

Why do u think it is torch issue? Detected kernel version 4.14.105, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.

lucasjinreal avatar Feb 20 '24 02:02 lucasjinreal

I am on sagemaker and using the same instance sometimes get this warning !

naarkhoo avatar Feb 22 '24 12:02 naarkhoo

Thanks for verifying @MangoFF.

What I'll do is we can keep this issue open as a reference for others. If you have issues with either:

  • Processes hang
  • Process orders are not in the right order
  • accelerate test fails because the main process was not first

And find Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. in your stack trace, you need to upgrade your linux kernel.

On our end:

If you experience this problem, and upgrading your linux kernel fixes it, it would be great for us to know a few things:

  1. The output of accelerate env
  2. What code was hanging
  3. Reacting to this comment with a 👍

If this is a wide enough issue, we will migrate this to a full RuntimeError instead of a regular warning.

Docker is ubuntu22.04, running in the k8s environment, and the host is the kernel version of el8 4.18. In this case, can we only solve it by upgrading the host's kernel?

Trangle avatar Mar 04 '24 09:03 Trangle

There is no prior version of accelerate because this is a torch issue itself. You can try downgrading PyTorch potentially

Do you solve the problem? Which is pytorch version compatible?

bmthanh avatar Mar 05 '24 18:03 bmthanh

Adding to the kernel version issue

"Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher."

Also in a shared server where I cannot upgrade the kernel currently. I think the issue is coming from the Trainer module? Downgrading to transformers v4.28 unfortunately didn't resolve this. Was anyone able to find a workaround?

dhuvik avatar Mar 07 '24 01:03 dhuvik

As mentioned, none of this stems from us, it’s PyTorch specifically. The warning exists to let you know that one of the most common issues we’ve found is this bug in PyTorch is due to the kernel. That is the only solution we know, else you do run the risk of having processes hang and not communicate.

muellerzr avatar Mar 07 '24 01:03 muellerzr

How to update Linux?

Gangjiang1 avatar Apr 12 '24 06:04 Gangjiang1

该怎么解决下面的问题呢 92b6d233337b7fdb048e77c5717494a

chenchen333-dev avatar Apr 15 '24 01:04 chenchen333-dev

facing the same issue in sagemaker and the process is hanging: WARNING:accelerate.utils.other:Detected kernel version 4.14.336, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.

- Accelerate` version: 0.29.3

  • Platform: Linux-4.14.336-257.568.amzn2.x86_64-x86_64-with-glibc2.35
  • accelerate bash location: /opt/conda/bin/accelerate
  • Python version: 3.10.6
  • Numpy version: 1.26.4
  • PyTorch version (GPU?): 2.3.0+cu121 (True)
  • PyTorch XPU available: False
  • PyTorch NPU available: False
  • PyTorch MLU available: False
  • System RAM: 186.91 GB
  • GPU type: NVIDIA A10G
  • Accelerate default config: Not found`

Any alternate fixes other than kernel upgrade?

harish-ganesh avatar May 03 '24 16:05 harish-ganesh

Anyone found anything on this? Would be helpful to share here, rather than all of us figuring out.

aajinkya1203 avatar Sep 07 '24 07:09 aajinkya1203

Adding to the kernel version issue

"Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher."

Also in a shared server where I cannot upgrade the kernel currently. I think the issue is coming from the Trainer module? Downgrading to transformers v4.28 unfortunately didn't resolve this. Was anyone able to find a workaround?

hi, have you slove this problem? I'am also in a shared server where I cannot upgrade the kernel.

Seperendity avatar Dec 19 '24 02:12 Seperendity

Changing the accelerate package version to 0.34.2 works for me

WWWsy03 avatar Feb 20 '25 10:02 WWWsy03