LMFlow
LMFlow copied to clipboard
[ERROR] [launch.py:324:sigkill_handler] exits with return code = -9
[2023-04-09 13:43:32,793] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 11961 [2023-04-09 13:43:32,907] [ERROR] [launch.py:324:sigkill_handler] ['/home/seali/anaconda3/bin/python', '-u', 'examples/finetune.py', '--local_rank=0', '--model_name_or_path', 'gpt2', '--dataset_path', '/home/seali/LMFlow-main/data/alpaca/train', '--output_dir', '/home/seali/LMFlow-main/output_models/finetune', '--overwrite_output_dir', '--num_train_epochs', '0.01', '--learning_rate', '2e-5', '--block_size', '512', '--per_device_train_batch_size', '1', '--deepspeed', 'configs/ds_config_zero3.json', '--bf16', '--run_name', 'finetune', '--validation_split_percentage', '0', '--logging_steps', '20', '--do_train', '--ddp_timeout', '72000', '--use_ram_optimized_load', 'False', '--save_steps', '5000', '--dataloader_num_workers', '1'] exits with return code = -9
code=-9,thanks for help!
same error, run on RTX3090 when runing the scripts/ren_finetune.sh with model LlaMA-7b, there is not ant other error information.My environment is cuda11.7 and deepspeed 0.8.3, it seems like an error on deepspeed.
same error, run on RTX3090 when runing the scripts/ren_finetune.sh with model LlaMA-7b, there is not ant other error information.My environment is cuda11.7 and deepspeed 0.8.3, it seems like an error on deepspeed.
me too, RTX 3060, RAM 8G, run_finetune.sh, cuda 11.7
same error, run on RTX3090 when runing the scripts/ren_finetune.sh with model LlaMA-7b, there is not ant other error information.My environment is cuda11.7 and deepspeed 0.8.3, it seems like an error on deepspeed.
me too, RTX 3060, RAM 8G, run_finetune.sh, cuda 11.7
I also try to use conda install cuda , but new error occurred like /usr/bin/ld cannot find -lxxxxx. After I solved the new errors, the old error also existed, and now I have no idea. But interestingly, I can run scripts/run_finetune_lora.sh successfully.
so interesting...i think the reason maybe the CPU or RAM not enoght big.
so interesting...i think the reason maybe the CPU or RAM not enoght big.
I think I can change the model to a smaller one and have another try.
Thanks for your interest in LMFlow! It is highly probable that the process was killed by the operating system due to out of RAM, i.e. CPU memory. This is a protection mechanism automatically triggered by Linux, since the OS has to protect itself from being killed (OS lives in RAM as well). You may move to a server with larger RAM or try smaller models. Thanks 😄
Thanks for your interest in LMFlow! It is highly probable that the process was killed by the operating system due to out of RAM, i.e. CPU memory. This is a protection mechanism automatically triggered by Linux, since the OS has to protect itself from being killed (OS lives in RAM as well). You may move to a server with larger RAM or try smaller models. Thanks 😄
I solved the problem with a large RAM, and I am now facing a new error the same as #177 .
Thanks for providing more detailed information! I am wondering if you are using multiple GPU or single GPU? You may try similar approaches suggested for that issue to see if they works for you. Thanks 🙏
my config: model size: 7b datasets: 10k rows of instructions device: 8xa100-80g, with 50 cores of cpu and 500 gb of ram.
Also KILLED. Damn...so sad
my config: model size: 7b datasets: 10k rows of instructions device: 8xa100-80g, with 50 cores of cpu and 500 gb of ram.
Also KILLED. Damn...so sad
Thanks for your interest in LMFlow! This is quite strange, since this setting should be sufficient even for a 60b model. Could you please check if there are any other users on that server? Also, it would be nice if you could provide detailed error messages and commands so that we may check that for you. Thanks 😄
We have 500 gb memory in total and uses <200 gb memory when running finetuning of a 7b model. For gpt2, the used memory is much less. Hope this information helps.
We have 500 gb memory in total and uses <200 gb memory when running finetuning of a 7b model. For gpt2, the used memory is much less. Hope this information helps.
wow, < 200GB? what is the size of the training data set?
my config: model size: 7b datasets: 10k rows of instructions device: 8xa100-80g, with 50 cores of cpu and 500 gb of ram. Also KILLED. Damn...so sad
Thanks for your interest in LMFlow! This is quite strange, since this setting should be sufficient even for a 60b model. Could you please check if there are any other users on that server? Also, it would be nice if you could provide detailed error messages and commands so that we may check that for you. Thanks 😄
Thanks for your attention. It's the same error message with this issue. Noticed that the grafana reported that the ram rushed to 400 GB, and it crashed. I think it is wired, too. The grafana didn't manage to catch the mem usage peak, not knowing if it's due to the second-level record interval
Could you check the deepspeed config? Using deepspeed zero2.config instead of zero3.config may help. The high demand of RAM may be caused by deepspeed ZeRO-Offload strategy, which needs to move some states to CPU.
Could you check the deepspeed config? Using deepspeed zero2.config instead of zero3.config may help. The high demand of RAM may be caused by deepspeed ZeRO-Offload strategy, which needs to move some states to CPU.
Thank you. However, it doesn't make sense, I commented all about the offload config in zer02.json
, but still crashed (7B model) when there is no enough RAM . My device has 64G
RAM and I'm using 3090Ti
for fintuning with lora.
64GB RAM is limited. Using a device with more than 200 GB would be better if running 7B model.
@shizhediao 您好,请问34G的模型需要多少内存啊?
@shizhediao 您好,请问34G的模型需要多少内存啊?
zero2,zeor3 分别需要多少,您有经验吗,数据集量级在2W?或者多少,社区成功过的经验是多少呀
34B的模型吗?模型参数量是多少? 没有具体的经验,每个模型都不太一样。可以测算一下https://deepspeed.readthedocs.io/en/latest/memory.html
ok,我试一下
34B的模型吗?模型参数量是多少? 没有具体的经验,每个模型都不太一样。可以测算一下https://deepspeed.readthedocs.io/en/latest/memory.html