[ERROR] [launch.py:321:sigkill_handler] exits with return code = -11
I am trying to finetune a LLM by running a finetune script (https://github.com/PKU-YuanGroup/Video-LLaVA/blob/main/scripts/v1_5/finetune.sh). I am using zero2_offload.json. After running the script the script automatically terminates by giving return code = -11
This is the finetune_script I am using
This is the error I am getting
ds_report output
**System info **
- OS: [Ubuntu 22.04.4 LTS]
- GPU count and types [x4 Tesla T4]
- Python version 3.10.14
Additional context I am using a AWS Cloud , I have also checked issue #4002 , but the error still persists .
I am getting this output when I use df -h
Please help me resolve this error.
@shag1802 - can you share your shm size if using docker at all?
I got the same error
I got the same error, and my sim size is 90GB.
@loadams When I'm trying to execute the cifar10 pipeline parallelism example provided here: https://github.com/deepspeedai/DeepSpeedExamples/tree/master/training/pipeline_parallelism with the exact code from the examples I get the error:
[2025-09-23 20:59:36,347] [ERROR] [launch.py:341:sigkill_handler] ['/usr/bin/python', '-u', 'train.py', '--local_rank=1', '--deepspeed_config=ds_config.json', '-p', '2', '--steps=200'] exits with return code = -11
I ran this on a 2xA100-SXM4-80GB node. With code & config exactly like in the DeepSpeed CIFAR-10 example. (I am using docker, as I'm renting GPUs via runpod)
Edit: I was able to fix this by setting the following environment variables:
export NCCL_IB_DISABLE=1
export NCCL_SOCKET_IFNAME=eth0
export TORCH_NCCL_BLOCKING_WAIT=1
export NCCL_ASYNC_ERROR_HANDLING=1