DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

[BUG] trainning [ERROR] [launch.py:434:sigkill_handler] exits with return code = -9

Open le153234 opened this issue 2 years ago • 16 comments

[2023-04-14 13:11:27,879] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 13266 [2023-04-14 13:11:27,885] [ERROR] [launch.py:434:sigkill_handler] ['/usr/bin/python3', '-u', 'main.py', '--local_rank=0', '--data_path', 'Dahoas/rm-static', 'Dahoas/full-hh-rlhf', 'Dahoas/synthetic-instruct-gptj-pairwise', 'yitingxie/rlhf-reward-datasets', 'openai/webgpt_comparisons', 'stanfordnlp/SHP', '--data_split', '2,4,4', '--model_name_or_path', 'facebook/opt-125m', '--per_device_train_batch_size', '2', '--per_device_eval_batch_size', '8', '--max_seq_len', '512', '--learning_rate', '9.65e-6', '--weight_decay', '0.1', '--num_train_epochs', '2', '--gradient_accumulation_steps', '1', '--lr_scheduler_type', 'cosine', '--num_warmup_steps', '0', '--seed', '1234', '--zero_stage', '2', '--deepspeed', '--output_dir', '/content/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/125m'] exits with return code = -9

le153234 avatar Apr 14 '23 13:04 le153234

@le153234 there should be an output log at /content/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/125m/training.log - can you share the contents of that file?

mrwyattii avatar Apr 14 '23 16:04 mrwyattii

attach the output log 125m-training.log

le153234 avatar Apr 14 '23 22:04 le153234

125m-training.log

le153234 avatar Apr 14 '23 22:04 le153234

I got the same error with GPT-J 6B

puyuanOT avatar Apr 15 '23 01:04 puyuanOT

I got the same error, but with return code=-7

stainswei avatar Apr 20 '23 08:04 stainswei

I'm getting the same error code, I'm trying to use demo setp cuda:11.6 torch=1.12 cudnn=8.4.0 python=3.8

MickeyJson avatar Apr 21 '23 02:04 MickeyJson

Same error here, No detail tips ~

wsl cat /proc/version Linux version 5.15.90.1-microsoft-standard-WSL2 (oe-user@oe-host) (x86_64-msft-linux-gcc (GCC) 9.3.0, GNU ld (GNU Binutils) 2.34.0.20200220) #1 SMP Fri Jan 27 02:56:13 UTC 2023

ds_report

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible

async_io ............... [NO] ....... [OKAY] cpu_adagrad ............ [NO] ....... [OKAY] cpu_adam ............... [NO] ....... [OKAY] fused_adam ............. [NO] ....... [OKAY] fused_lamb ............. [NO] ....... [OKAY] quantizer .............. [NO] ....... [OKAY] random_ltd ............. [NO] ....... [OKAY] [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0 [WARNING] using untested triton version (2.0.0), only 1.0.0 is known to be compatible sparse_attn ............ [NO] ....... [NO] spatial_inference ...... [NO] ....... [OKAY] transformer ............ [NO] ....... [OKAY] stochastic_transformer . [NO] ....... [OKAY] transformer_inference .. [NO] ....... [OKAY] utils .................. [NO] ....... [OKAY]

DeepSpeed general environment info: torch install path ............... ['/home/dev/.local/lib/python3.8/site-packages/torch'] torch version .................... 2.0.0+cu117 deepspeed install path ........... ['/home/dev/.local/lib/python3.8/site-packages/deepspeed'] deepspeed info ................... 0.9.1, unknown, unknown torch cuda version ............... 11.7 torch hip version ................ None nvcc version ..................... 12.1 deepspeed wheel compiled w. ...... torch 2.0, cuda 11.7

afeilulu avatar Apr 22 '23 01:04 afeilulu

any updates?

iFocusing avatar Apr 26 '23 06:04 iFocusing

Hi,

Same issue for me, no detail information of that error in my output. If any reference link on this error or any update on this issue?

dineshreddy221 avatar May 06 '23 16:05 dineshreddy221

same here

hongyix avatar May 24 '23 03:05 hongyix

Same error

yingying123321 avatar Jun 18 '23 00:06 yingying123321

same error

Khachdallak02 avatar Jul 20 '23 07:07 Khachdallak02

same error

RanchiZhao avatar Aug 21 '23 07:08 RanchiZhao

same error

naginoa avatar Sep 26 '23 08:09 naginoa

same error

dsn01 avatar Jul 10 '24 07:07 dsn01