LMFlow icon indicating copy to clipboard operation
LMFlow copied to clipboard

[ERROR] [launch.py:324:sigkill_handler] exits with return code = -9

Open faint32 opened this issue 1 year ago • 20 comments

[2023-04-09 13:43:32,793] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 11961 [2023-04-09 13:43:32,907] [ERROR] [launch.py:324:sigkill_handler] ['/home/seali/anaconda3/bin/python', '-u', 'examples/finetune.py', '--local_rank=0', '--model_name_or_path', 'gpt2', '--dataset_path', '/home/seali/LMFlow-main/data/alpaca/train', '--output_dir', '/home/seali/LMFlow-main/output_models/finetune', '--overwrite_output_dir', '--num_train_epochs', '0.01', '--learning_rate', '2e-5', '--block_size', '512', '--per_device_train_batch_size', '1', '--deepspeed', 'configs/ds_config_zero3.json', '--bf16', '--run_name', 'finetune', '--validation_split_percentage', '0', '--logging_steps', '20', '--do_train', '--ddp_timeout', '72000', '--use_ram_optimized_load', 'False', '--save_steps', '5000', '--dataloader_num_workers', '1'] exits with return code = -9

code=-9,thanks for help!

faint32 avatar Apr 09 '23 05:04 faint32

same error, run on RTX3090 when runing the scripts/ren_finetune.sh with model LlaMA-7b, there is not ant other error information.My environment is cuda11.7 and deepspeed 0.8.3, it seems like an error on deepspeed.

gugugu-469 avatar Apr 09 '23 12:04 gugugu-469

same error, run on RTX3090 when runing the scripts/ren_finetune.sh with model LlaMA-7b, there is not ant other error information.My environment is cuda11.7 and deepspeed 0.8.3, it seems like an error on deepspeed.

me too, RTX 3060, RAM 8G, run_finetune.sh, cuda 11.7

faint32 avatar Apr 09 '23 14:04 faint32

same error, run on RTX3090 when runing the scripts/ren_finetune.sh with model LlaMA-7b, there is not ant other error information.My environment is cuda11.7 and deepspeed 0.8.3, it seems like an error on deepspeed.

me too, RTX 3060, RAM 8G, run_finetune.sh, cuda 11.7

I also try to use conda install cuda , but new error occurred like /usr/bin/ld cannot find -lxxxxx. After I solved the new errors, the old error also existed, and now I have no idea. But interestingly, I can run scripts/run_finetune_lora.sh successfully.

gugugu-469 avatar Apr 09 '23 14:04 gugugu-469

so interesting...i think the reason maybe the CPU or RAM not enoght big.

faint32 avatar Apr 09 '23 14:04 faint32

so interesting...i think the reason maybe the CPU or RAM not enoght big.

I think I can change the model to a smaller one and have another try.

gugugu-469 avatar Apr 09 '23 14:04 gugugu-469

Thanks for your interest in LMFlow! It is highly probable that the process was killed by the operating system due to out of RAM, i.e. CPU memory. This is a protection mechanism automatically triggered by Linux, since the OS has to protect itself from being killed (OS lives in RAM as well). You may move to a server with larger RAM or try smaller models. Thanks 😄

research4pan avatar Apr 09 '23 16:04 research4pan

Thanks for your interest in LMFlow! It is highly probable that the process was killed by the operating system due to out of RAM, i.e. CPU memory. This is a protection mechanism automatically triggered by Linux, since the OS has to protect itself from being killed (OS lives in RAM as well). You may move to a server with larger RAM or try smaller models. Thanks 😄

I solved the problem with a large RAM, and I am now facing a new error the same as #177 .

gugugu-469 avatar Apr 10 '23 03:04 gugugu-469

Thanks for providing more detailed information! I am wondering if you are using multiple GPU or single GPU? You may try similar approaches suggested for that issue to see if they works for you. Thanks 🙏

research4pan avatar Apr 10 '23 14:04 research4pan

my config: model size: 7b datasets: 10k rows of instructions device: 8xa100-80g, with 50 cores of cpu and 500 gb of ram.

Also KILLED. Damn...so sad

qiguanqiang avatar Apr 16 '23 15:04 qiguanqiang

my config: model size: 7b datasets: 10k rows of instructions device: 8xa100-80g, with 50 cores of cpu and 500 gb of ram.

Also KILLED. Damn...so sad

Thanks for your interest in LMFlow! This is quite strange, since this setting should be sufficient even for a 60b model. Could you please check if there are any other users on that server? Also, it would be nice if you could provide detailed error messages and commands so that we may check that for you. Thanks 😄

research4pan avatar Apr 17 '23 19:04 research4pan

We have 500 gb memory in total and uses <200 gb memory when running finetuning of a 7b model. For gpt2, the used memory is much less. Hope this information helps.

research4pan avatar Apr 19 '23 19:04 research4pan

We have 500 gb memory in total and uses <200 gb memory when running finetuning of a 7b model. For gpt2, the used memory is much less. Hope this information helps.

wow, < 200GB? what is the size of the training data set?

Dandelionym avatar Apr 20 '23 00:04 Dandelionym

my config: model size: 7b datasets: 10k rows of instructions device: 8xa100-80g, with 50 cores of cpu and 500 gb of ram. Also KILLED. Damn...so sad

Thanks for your interest in LMFlow! This is quite strange, since this setting should be sufficient even for a 60b model. Could you please check if there are any other users on that server? Also, it would be nice if you could provide detailed error messages and commands so that we may check that for you. Thanks 😄

Thanks for your attention. It's the same error message with this issue. Noticed that the grafana reported that the ram rushed to 400 GB, and it crashed. I think it is wired, too. The grafana didn't manage to catch the mem usage peak, not knowing if it's due to the second-level record interval

qiguanqiang avatar Apr 20 '23 03:04 qiguanqiang

Could you check the deepspeed config? Using deepspeed zero2.config instead of zero3.config may help. The high demand of RAM may be caused by deepspeed ZeRO-Offload strategy, which needs to move some states to CPU.

shizhediao avatar Apr 20 '23 06:04 shizhediao

Could you check the deepspeed config? Using deepspeed zero2.config instead of zero3.config may help. The high demand of RAM may be caused by deepspeed ZeRO-Offload strategy, which needs to move some states to CPU.

Thank you. However, it doesn't make sense, I commented all about the offload config in zer02.json, but still crashed (7B model) when there is no enough RAM . My device has 64G RAM and I'm using 3090Ti for fintuning with lora.

Dandelionym avatar Apr 20 '23 07:04 Dandelionym

64GB RAM is limited. Using a device with more than 200 GB would be better if running 7B model.

shizhediao avatar May 15 '23 00:05 shizhediao

@shizhediao 您好,请问34G的模型需要多少内存啊?

xiaohangguo avatar Aug 31 '23 02:08 xiaohangguo

@shizhediao 您好,请问34G的模型需要多少内存啊?

zero2,zeor3 分别需要多少,您有经验吗,数据集量级在2W?或者多少,社区成功过的经验是多少呀

xiaohangguo avatar Aug 31 '23 03:08 xiaohangguo

34B的模型吗?模型参数量是多少? 没有具体的经验,每个模型都不太一样。可以测算一下https://deepspeed.readthedocs.io/en/latest/memory.html

shizhediao avatar Sep 04 '23 21:09 shizhediao

ok,我试一下

34B的模型吗?模型参数量是多少? 没有具体的经验,每个模型都不太一样。可以测算一下https://deepspeed.readthedocs.io/en/latest/memory.html

xiaohangguo avatar Sep 05 '23 04:09 xiaohangguo