ChatGLM-6B ds_train_finetune.sh之前没有任何报错，最后直接sigkill

Is there an existing issue for this?

[X] I have searched the existing issues

Current Behavior

执行ds_train_finetune.sh

root@074e2a33256d:~/ChatGLM-6B/ptuning# sh ds_train_finetune.sh [2023-05-16 07:01:05,716] [WARNING] [runner.py:191:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2023-05-16 07:01:09,857] [INFO] [runner.py:541:main] cmd = /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=54002 --enable_each_rank_log=None main.py --deepspeed deepspeed.json --do_train --train_file /root/ChatGLM-6B/ptuning/AdvertiseGen/train.json --test_file /root/ChatGLM-6B/ptuning/AdvertiseGen/dev.json --prompt_column content --response_column summary --overwrite_cache --model_name_or_path /root/ChatGLM-6B/chatglm-6b --output_dir /root/ChatGLM-6B/ptuning/output/adgen-chatglm-6b-ft-1e-4 --overwrite_output_dir --max_source_length 64 --max_target_length 64 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 1 --predict_with_generate --max_steps 3000 --logging_steps 10 --save_steps 300 --learning_rate 1e-4 --fp16 [2023-05-16 07:01:12,088] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]} [2023-05-16 07:01:12,088] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=4, node_rank=0 [2023-05-16 07:01:12,088] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]}) [2023-05-16 07:01:12,088] [INFO] [launch.py:247:main] dist_world_size=4 [2023-05-16 07:01:12,089] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3 [2023-05-16 07:01:14,659] [INFO] [comm.py:622:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl 05/16/2023 07:01:14 - WARNING - main - Process rank: 3, device: cuda:3, n_gpu: 1distributed training: True, 16-bits training: True 05/16/2023 07:01:14 - WARNING - main - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: True 05/16/2023 07:01:14 - WARNING - main - Process rank: 2, device: cuda:2, n_gpu: 1distributed training: True, 16-bits training: True 05/16/2023 07:01:15 - INFO - main - Training/evaluation parameters Seq2SeqTrainingArguments( _n_gpu=1, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, bf16=False, bf16_full_eval=False, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=0, dataloader_pin_memory=True, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=None, ddp_timeout=1800, debug=[], deepspeed=deepspeed.json, disable_tqdm=False, do_eval=False, do_predict=False, do_train=True, eval_accumulation_steps=None, eval_delay=0, eval_steps=None, evaluation_strategy=no, fp16=True, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'fsdp_min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, generation_config=None, generation_max_length=None, generation_num_beams=None, gradient_accumulation_steps=1, gradient_checkpointing=False, greater_is_better=None, group_by_length=False, half_precision_backend=auto, hub_model_id=None, hub_private_repo=False, hub_strategy=every_save, hub_token=<HUB_TOKEN>, ignore_data_skip=False, include_inputs_for_metrics=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, learning_rate=0.0001, length_column_name=length, load_best_model_at_end=False, local_rank=0, log_level=passive, log_level_replica=warning, log_on_each_node=True, logging_dir=/root/ChatGLM-6B/ptuning/output/adgen-chatglm-6b-ft-1e-4/runs/May16_07-01-14_074e2a33256d, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=10, logging_strategy=steps, lr_scheduler_type=linear, max_grad_norm=1.0, max_steps=3000, metric_for_best_model=None, mp_parameters=, no_cuda=False, num_train_epochs=1, optim=adamw_hf, optim_args=None, output_dir=/root/ChatGLM-6B/ptuning/output/adgen-chatglm-6b-ft-1e-4, overwrite_output_dir=True, past_index=-1, per_device_eval_batch_size=1, per_device_train_batch_size=1, predict_with_generate=True, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=<PUSH_TO_HUB_TOKEN>, ray_scope=last, remove_unused_columns=True, report_to=[], resume_from_checkpoint=None, run_name=/root/ChatGLM-6B/ptuning/output/adgen-chatglm-6b-ft-1e-4, save_on_each_node=False, save_safetensors=False, save_steps=300, save_strategy=steps, save_total_limit=None, seed=42, sharded_ddp=[], skip_memory_metrics=True, sortish_sampler=False, tf32=None, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torchdynamo=None, tpu_metrics_debug=False, tpu_num_cores=None, use_ipex=False, use_legacy_prediction_loop=False, use_mps_device=False, warmup_ratio=0.0, warmup_steps=0, weight_decay=0.0, xpu_backend=None, ) 05/16/2023 07:01:15 - WARNING - main - Process rank: 1, device: cuda:1, n_gpu: 1distributed training: True, 16-bits training: True 05/16/2023 07:01:55 - WARNING - datasets.builder - Found cached dataset json (/root/.cache/huggingface/datasets/json/default-23dc05be1ca54122/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4) 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 359.59it/s] [WARNING|configuration_auto.py:925] 2023-05-16 07:01:55,058 >> Explicitly passing a revision is encouraged when loading a configuration with custom code to ensure no malicious code has been contributed in a newer revision. 05/16/2023 07:01:55 - WARNING - datasets.builder - Found cached dataset json (/root/.cache/huggingface/datasets/json/default-23dc05be1ca54122/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4) [WARNING|tokenization_auto.py:675] 2023-05-16 07:01:55,062 >> Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision. 0%| | 0/2 [00:00<?, ?it/s]05/16/2023 07:01:55 - WARNING - datasets.builder - Found cached dataset json (/root/.cache/huggingface/datasets/json/default-23dc05be1ca54122/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4) 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 248.46it/s] [WARNING|configuration_auto.py:925] 2023-05-16 07:01:55,073 >> Explicitly passing a revision is encouraged when loading a configuration with custom code to ensure no malicious code has been contributed in a newer revision. 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 245.85it/s] [INFO|configuration_utils.py:666] 2023-05-16 07:01:55,077 >> loading configuration file /root/ChatGLM-6B/chatglm-6b/config.json [WARNING|configuration_auto.py:925] 2023-05-16 07:01:55,077 >> Explicitly passing a revision is encouraged when loading a configuration with custom code to ensure no malicious code has been contributed in a newer revision. [WARNING|tokenization_auto.py:675] 2023-05-16 07:01:55,080 >> Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision. [INFO|configuration_utils.py:666] 2023-05-16 07:01:55,082 >> loading configuration file /root/ChatGLM-6B/chatglm-6b/config.json [INFO|configuration_utils.py:720] 2023-05-16 07:01:55,084 >> Model config ChatGLMConfig { "_name_or_path": "/root/ChatGLM-6B/chatglm-6b", "architectures": [ "ChatGLMModel" ], "auto_map": { "AutoConfig": "configuration_chatglm.ChatGLMConfig", "AutoModel": "modeling_chatglm.ChatGLMForConditionalGeneration", "AutoModelForSeq2SeqLM": "modeling_chatglm.ChatGLMForConditionalGeneration" }, "bos_token_id": 130004, "eos_token_id": 130005, "gmask_token_id": 130001, "hidden_size": 4096, "inner_hidden_size": 16384, "layernorm_epsilon": 1e-05, "mask_token_id": 130000, "max_sequence_length": 2048, "model_type": "chatglm", "num_attention_heads": 32, "num_layers": 28, "pad_token_id": 3, "position_encoding_2d": true, "pre_seq_len": null, "prefix_projection": false, "quantization_bit": 0, "torch_dtype": "float16", "transformers_version": "4.28.0", "use_cache": true, "vocab_size": 130528 }

[WARNING|tokenization_auto.py:675] 2023-05-16 07:01:55,085 >> Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision. [INFO|tokenization_utils_base.py:1807] 2023-05-16 07:01:55,090 >> loading file ice_text.model [INFO|tokenization_utils_base.py:1807] 2023-05-16 07:01:55,090 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:1807] 2023-05-16 07:01:55,091 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:1807] 2023-05-16 07:01:55,091 >> loading file tokenizer_config.json 05/16/2023 07:01:55 - WARNING - datasets.builder - Found cached dataset json (/root/.cache/huggingface/datasets/json/default-23dc05be1ca54122/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4) 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 263.81it/s] [WARNING|configuration_auto.py:925] 2023-05-16 07:01:55,127 >> Explicitly passing a revision is encouraged when loading a configuration with custom code to ensure no malicious code has been contributed in a newer revision. [WARNING|tokenization_auto.py:675] 2023-05-16 07:01:55,134 >> Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision. [WARNING|auto_factory.py:456] 2023-05-16 07:01:55,381 >> Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision. [WARNING|auto_factory.py:456] 2023-05-16 07:01:55,411 >> Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision. [INFO|modeling_utils.py:2531] 2023-05-16 07:01:55,432 >> loading weights file /root/ChatGLM-6B/chatglm-6b/pytorch_model.bin.index.json [WARNING|auto_factory.py:456] 2023-05-16 07:01:55,433 >> Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision. [INFO|configuration_utils.py:575] 2023-05-16 07:01:55,433 >> Generate config GenerationConfig { "_from_model_config": true, "bos_token_id": 130004, "eos_token_id": 130005, "pad_token_id": 3, "transformers_version": "4.28.0" }

[WARNING|auto_factory.py:456] 2023-05-16 07:01:55,469 >> Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision. Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:12<00:00, 1.60s/it] Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:13<00:00, 1.63s/it] [INFO|modeling_utils.py:3190] 2023-05-16 07:02:08,610 >> All model checkpoint weights were used when initializing ChatGLMForConditionalGeneration.

[INFO|modeling_utils.py:3198] 2023-05-16 07:02:08,611 >> All the weights of ChatGLMForConditionalGeneration were initialized from the model checkpoint at /root/ChatGLM-6B/chatglm-6b. If your task is similar to the task the model of the checkpoint was trained on, you can already use ChatGLMForConditionalGeneration for predictions without further training. [INFO|modeling_utils.py:2839] 2023-05-16 07:02:08,781 >> Generation config file not found, using a generation config created from the model config. Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:13<00:00, 1.66s/it] Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:13<00:00, 1.65s/it] [2023-05-16 07:03:39,302] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 28625
[2023-05-16 07:03:39,312] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 28626 [2023-05-16 07:03:39,314] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 28627 [2023-05-16 07:03:39,316] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 28628 [2023-05-16 07:03:39,316] [ERROR] [launch.py:434:sigkill_handler] ['/usr/bin/python3', '-u', 'main.py', '--local_rank=3', '--deepspeed', 'deepspeed.json', '--do_train', '--train_file', '/root/ChatGLM-6B/ptuning/AdvertiseGen/train.json', '--test_file', '/root/ChatGLM-6B/ptuning/AdvertiseGen/dev.json', '--prompt_column', 'content', '--response_column', 'summary', '--overwrite_cache', '--model_name_or_path', '/root/ChatGLM-6B/chatglm-6b', '--output_dir', '/root/ChatGLM-6B/ptuning/output/adgen-chatglm-6b-ft-1e-4', '--overwrite_output_dir', '--max_source_length', '64', '--max_target_length', '64', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '1', '--gradient_accumulation_steps', '1', '--predict_with_generate', '--max_steps', '3000', '--logging_steps', '10', '--save_steps', '300', '--learning_rate', '1e-4', '--fp16'] exits with return code = -7

Expected Behavior

准备数据集安装ds 进行微调

Steps To Reproduce

1.在Ubuntu环境 2.安装deepspeed 3.使用ChatGLM-6B微调

Environment

OS: Ubuntu 20.04
Python: 3.8
Transformers: 4.28.0
PyTorch: 1.12
CUDA Support: True

Anything else?

4个GPU，每个11G,我看异常中断时，CPU内存变的很少。GPU内存使用很少

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

May 16 '23 07:05 hexiaojin1314

可能是因为硬件不行

May 16 '23 12:05 hexiaojin1314

killing 或者killed 一般是内存不足（至少我见过都是）

May 19 '23 01:05 runzhi214

一模一样的问题，莫名其妙的kill，没有其他报错信息。 torch 2.0 transformers 4.29.2 唯一可能怀疑的点是，目前是在GPU服务器上运维给开的ubuntu docker nvidia-smi能看到四张3090，跑的时候VRAM占用在726M，也不是因为内存满了。

运行： [INFO|tokenization_utils_base.py:1808] 2023-05-20 11:34:29,663 >> loading file ice_text.model [INFO|tokenization_utils_base.py:1808] 2023-05-20 11:34:29,663 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:1808] 2023-05-20 11:34:29,663 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:1808] 2023-05-20 11:34:29,663 >> loading file tokenizer_config.json [INFO|modeling_utils.py:2513] 2023-05-20 11:34:30,013 >> loading weights file ../model/0417/pytorch_model.bin.index.json [INFO|configuration_utils.py:577] 2023-05-20 11:34:30,014 >> Generate config GenerationConfig { "_from_model_config": true, "bos_token_id": 130004, "eos_token_id": 130005, "pad_token_id": 3, "transformers_version": "4.29.2" }

Loading checkpoint shards: 12%|█████████▎ | 1/8 [00:01<00:10, 1.49s/it]05/20/2023 11:34:31 - WARNING - datasets.builder - Found cached dataset json (/root/.cache/huggingface/datasets/json/default-6deeb7fde503f793/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4) 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 422.47it/s] Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████| 8/8 [00:08<00:00, 1.01s/it] Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████| 8/8 [00:12<00:00, 1.55s/it] Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████| 8/8 [00:12<00:00, 1.54s/it] [INFO|modeling_utils.py:3185] 2023-05-20 11:34:42,520 >> All model checkpoint weights were used when initializing ChatGLMForConditionalGeneration.

[INFO|modeling_utils.py:3193] 2023-05-20 11:34:42,520 >> All the weights of ChatGLMForConditionalGeneration were initialized from the model checkpoint at ../model/0417. If your task is similar to the task the model of the checkpoint was trained on, you can already use ChatGLMForConditionalGeneration for predictions without further training. [INFO|modeling_utils.py:2821] 2023-05-20 11:34:42,588 >> Generation config file not found, using a generation config created from the model config. Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████| 8/8 [00:12<00:00, 1.58s/it] [2023-05-20 11:42:18,856] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 18729
[2023-05-20 11:42:18,857] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 18730 [2023-05-20 11:42:23,613] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 18731 [2023-05-20 11:42:23,615] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 18732 [2023-05-20 11:42:23,617] [ERROR] [launch.py:434:sigkill_handler] ['/root/anaconda3/envs/py39/bin/python', '-u', 'main.py', '--local_rank=3', '--deepspeed', 'deepspeed.json', '--do_train', '--train_file', 'data/AdvertiseGen/train.json', '--test_file', 'data/AdvertiseGen/dev.json', '--prompt_column', 'content', '--response_column', 'summary', '--overwrite_cache', '--model_name_or_path', '../model/0417', '--output_dir', './output/adgen-chatglm-6b-fullft-demo-1e-4', '--overwrite_output_dir', '--max_source_length', '64', '--max_target_length', '64', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '1', '--gradient_accumulation_steps', '1', '--predict_with_generate', '--max_steps', '5000', '--logging_steps', '10', '--save_steps', '1000', '--learning_rate', '1e-4', '--fp16'] exits with return code = -7