DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

[ERROR] [launch.py:321:sigkill_handler] exits with return code = -11

Open shag1802 opened this issue 1 year ago • 4 comments

I am trying to finetune a LLM by running a finetune script (https://github.com/PKU-YuanGroup/Video-LLaVA/blob/main/scripts/v1_5/finetune.sh). I am using zero2_offload.json. After running the script the script automatically terminates by giving return code = -11

This is the finetune_script I am using image

This is the error I am getting image

ds_report output image

**System info **

  • OS: [Ubuntu 22.04.4 LTS]
  • GPU count and types [x4 Tesla T4]
  • Python version 3.10.14

Additional context I am using a AWS Cloud , I have also checked issue #4002 , but the error still persists .

I am getting this output when I use df -h image

Please help me resolve this error.

shag1802 avatar Jun 21 '24 11:06 shag1802

@shag1802 - can you share your shm size if using docker at all?

loadams avatar Jul 22 '24 22:07 loadams

I got the same error

TengfeiSong000 avatar Aug 05 '24 01:08 TengfeiSong000

I got the same error, and my sim size is 90GB.

Williamleejx avatar Aug 08 '25 07:08 Williamleejx

@loadams When I'm trying to execute the cifar10 pipeline parallelism example provided here: https://github.com/deepspeedai/DeepSpeedExamples/tree/master/training/pipeline_parallelism with the exact code from the examples I get the error:

[2025-09-23 20:59:36,347] [ERROR] [launch.py:341:sigkill_handler] ['/usr/bin/python', '-u', 'train.py', '--local_rank=1', '--deepspeed_config=ds_config.json', '-p', '2', '--steps=200'] exits with return code = -11

I ran this on a 2xA100-SXM4-80GB node. With code & config exactly like in the DeepSpeed CIFAR-10 example. (I am using docker, as I'm renting GPUs via runpod)

Edit: I was able to fix this by setting the following environment variables:

export NCCL_IB_DISABLE=1
export NCCL_SOCKET_IFNAME=eth0
export TORCH_NCCL_BLOCKING_WAIT=1
export NCCL_ASYNC_ERROR_HANDLING=1

benearnthof avatar Sep 23 '25 21:09 benearnthof