ds_train_finetune.sh之前没有任何报错,最后直接sigkill_handler] Killing subprocess
Is there an existing issue for this?
- [X] I have searched the existing issues
Current Behavior
执行ds_train_finetune.sh
root@074e2a33256d:~/ChatGLM-6B/ptuning# sh ds_train_finetune.sh
[2023-05-16 07:01:05,716] [WARNING] [runner.py:191:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-05-16 07:01:09,857] [INFO] [runner.py:541:main] cmd = /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=54002 --enable_each_rank_log=None main.py --deepspeed deepspeed.json --do_train --train_file /root/ChatGLM-6B/ptuning/AdvertiseGen/train.json --test_file /root/ChatGLM-6B/ptuning/AdvertiseGen/dev.json --prompt_column content --response_column summary --overwrite_cache --model_name_or_path /root/ChatGLM-6B/chatglm-6b --output_dir /root/ChatGLM-6B/ptuning/output/adgen-chatglm-6b-ft-1e-4 --overwrite_output_dir --max_source_length 64 --max_target_length 64 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 1 --predict_with_generate --max_steps 3000 --logging_steps 10 --save_steps 300 --learning_rate 1e-4 --fp16
[2023-05-16 07:01:12,088] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
[2023-05-16 07:01:12,088] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=4, node_rank=0
[2023-05-16 07:01:12,088] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
[2023-05-16 07:01:12,088] [INFO] [launch.py:247:main] dist_world_size=4
[2023-05-16 07:01:12,089] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
[2023-05-16 07:01:14,659] [INFO] [comm.py:622:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
05/16/2023 07:01:14 - WARNING - main - Process rank: 3, device: cuda:3, n_gpu: 1distributed training: True, 16-bits training: True
05/16/2023 07:01:14 - WARNING - main - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: True
05/16/2023 07:01:14 - WARNING - main - Process rank: 2, device: cuda:2, n_gpu: 1distributed training: True, 16-bits training: True
05/16/2023 07:01:15 - INFO - main - Training/evaluation parameters Seq2SeqTrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=deepspeed.json,
disable_tqdm=False,
do_eval=False,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=no,
fp16=True,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'fsdp_min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
generation_config=None,
generation_max_length=None,
generation_num_beams=None,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=0.0001,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_level=passive,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=/root/ChatGLM-6B/ptuning/output/adgen-chatglm-6b-ft-1e-4/runs/May16_07-01-14_074e2a33256d,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=10,
logging_strategy=steps,
lr_scheduler_type=linear,
max_grad_norm=1.0,
max_steps=3000,
metric_for_best_model=None,
mp_parameters=,
no_cuda=False,
num_train_epochs=1,
optim=adamw_hf,
optim_args=None,
output_dir=/root/ChatGLM-6B/ptuning/output/adgen-chatglm-6b-ft-1e-4,
overwrite_output_dir=True,
past_index=-1,
per_device_eval_batch_size=1,
per_device_train_batch_size=1,
predict_with_generate=True,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=True,
report_to=[],
resume_from_checkpoint=None,
run_name=/root/ChatGLM-6B/ptuning/output/adgen-chatglm-6b-ft-1e-4,
save_on_each_node=False,
save_safetensors=False,
save_steps=300,
save_strategy=steps,
save_total_limit=None,
seed=42,
sharded_ddp=[],
skip_memory_metrics=True,
sortish_sampler=False,
tf32=None,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.0,
warmup_steps=0,
weight_decay=0.0,
xpu_backend=None,
)
05/16/2023 07:01:15 - WARNING - main - Process rank: 1, device: cuda:1, n_gpu: 1distributed training: True, 16-bits training: True
05/16/2023 07:01:55 - WARNING - datasets.builder - Found cached dataset json (/root/.cache/huggingface/datasets/json/default-23dc05be1ca54122/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4)
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 359.59it/s]
[WARNING|configuration_auto.py:925] 2023-05-16 07:01:55,058 >> Explicitly passing a revision is encouraged when loading a configuration with custom code to ensure no malicious code has been contributed in a newer revision.
05/16/2023 07:01:55 - WARNING - datasets.builder - Found cached dataset json (/root/.cache/huggingface/datasets/json/default-23dc05be1ca54122/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4)
[WARNING|tokenization_auto.py:675] 2023-05-16 07:01:55,062 >> Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
0%| | 0/2 [00:00<?, ?it/s]05/16/2023 07:01:55 - WARNING - datasets.builder - Found cached dataset json (/root/.cache/huggingface/datasets/json/default-23dc05be1ca54122/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4)
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 248.46it/s]
[WARNING|configuration_auto.py:925] 2023-05-16 07:01:55,073 >> Explicitly passing a revision is encouraged when loading a configuration with custom code to ensure no malicious code has been contributed in a newer revision.
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 245.85it/s]
[INFO|configuration_utils.py:666] 2023-05-16 07:01:55,077 >> loading configuration file /root/ChatGLM-6B/chatglm-6b/config.json
[WARNING|configuration_auto.py:925] 2023-05-16 07:01:55,077 >> Explicitly passing a revision is encouraged when loading a configuration with custom code to ensure no malicious code has been contributed in a newer revision.
[WARNING|tokenization_auto.py:675] 2023-05-16 07:01:55,080 >> Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
[INFO|configuration_utils.py:666] 2023-05-16 07:01:55,082 >> loading configuration file /root/ChatGLM-6B/chatglm-6b/config.json
[INFO|configuration_utils.py:720] 2023-05-16 07:01:55,084 >> Model config ChatGLMConfig {
"_name_or_path": "/root/ChatGLM-6B/chatglm-6b",
"architectures": [
"ChatGLMModel"
],
"auto_map": {
"AutoConfig": "configuration_chatglm.ChatGLMConfig",
"AutoModel": "modeling_chatglm.ChatGLMForConditionalGeneration",
"AutoModelForSeq2SeqLM": "modeling_chatglm.ChatGLMForConditionalGeneration"
},
"bos_token_id": 130004,
"eos_token_id": 130005,
"gmask_token_id": 130001,
"hidden_size": 4096,
"inner_hidden_size": 16384,
"layernorm_epsilon": 1e-05,
"mask_token_id": 130000,
"max_sequence_length": 2048,
"model_type": "chatglm",
"num_attention_heads": 32,
"num_layers": 28,
"pad_token_id": 3,
"position_encoding_2d": true,
"pre_seq_len": null,
"prefix_projection": false,
"quantization_bit": 0,
"torch_dtype": "float16",
"transformers_version": "4.28.0",
"use_cache": true,
"vocab_size": 130528
}
[WARNING|tokenization_auto.py:675] 2023-05-16 07:01:55,085 >> Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
[INFO|tokenization_utils_base.py:1807] 2023-05-16 07:01:55,090 >> loading file ice_text.model
[INFO|tokenization_utils_base.py:1807] 2023-05-16 07:01:55,090 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:1807] 2023-05-16 07:01:55,091 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:1807] 2023-05-16 07:01:55,091 >> loading file tokenizer_config.json
05/16/2023 07:01:55 - WARNING - datasets.builder - Found cached dataset json (/root/.cache/huggingface/datasets/json/default-23dc05be1ca54122/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4)
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 263.81it/s]
[WARNING|configuration_auto.py:925] 2023-05-16 07:01:55,127 >> Explicitly passing a revision is encouraged when loading a configuration with custom code to ensure no malicious code has been contributed in a newer revision.
[WARNING|tokenization_auto.py:675] 2023-05-16 07:01:55,134 >> Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
[WARNING|auto_factory.py:456] 2023-05-16 07:01:55,381 >> Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
[WARNING|auto_factory.py:456] 2023-05-16 07:01:55,411 >> Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
[INFO|modeling_utils.py:2531] 2023-05-16 07:01:55,432 >> loading weights file /root/ChatGLM-6B/chatglm-6b/pytorch_model.bin.index.json
[WARNING|auto_factory.py:456] 2023-05-16 07:01:55,433 >> Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
[INFO|configuration_utils.py:575] 2023-05-16 07:01:55,433 >> Generate config GenerationConfig {
"_from_model_config": true,
"bos_token_id": 130004,
"eos_token_id": 130005,
"pad_token_id": 3,
"transformers_version": "4.28.0"
}
[WARNING|auto_factory.py:456] 2023-05-16 07:01:55,469 >> Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:12<00:00, 1.60s/it]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:13<00:00, 1.63s/it]
[INFO|modeling_utils.py:3190] 2023-05-16 07:02:08,610 >> All model checkpoint weights were used when initializing ChatGLMForConditionalGeneration.
[INFO|modeling_utils.py:3198] 2023-05-16 07:02:08,611 >> All the weights of ChatGLMForConditionalGeneration were initialized from the model checkpoint at /root/ChatGLM-6B/chatglm-6b.
If your task is similar to the task the model of the checkpoint was trained on, you can already use ChatGLMForConditionalGeneration for predictions without further training.
[INFO|modeling_utils.py:2839] 2023-05-16 07:02:08,781 >> Generation config file not found, using a generation config created from the model config.
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:13<00:00, 1.66s/it]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:13<00:00, 1.65s/it]
[2023-05-16 07:03:39,302] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 28625
[2023-05-16 07:03:39,312] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 28626
[2023-05-16 07:03:39,314] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 28627
[2023-05-16 07:03:39,316] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 28628
[2023-05-16 07:03:39,316] [ERROR] [launch.py:434:sigkill_handler] ['/usr/bin/python3', '-u', 'main.py', '--local_rank=3', '--deepspeed', 'deepspeed.json', '--do_train', '--train_file', '/root/ChatGLM-6B/ptuning/AdvertiseGen/train.json', '--test_file', '/root/ChatGLM-6B/ptuning/AdvertiseGen/dev.json', '--prompt_column', 'content', '--response_column', 'summary', '--overwrite_cache', '--model_name_or_path', '/root/ChatGLM-6B/chatglm-6b', '--output_dir', '/root/ChatGLM-6B/ptuning/output/adgen-chatglm-6b-ft-1e-4', '--overwrite_output_dir', '--max_source_length', '64', '--max_target_length', '64', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '1', '--gradient_accumulation_steps', '1', '--predict_with_generate', '--max_steps', '3000', '--logging_steps', '10', '--save_steps', '300', '--learning_rate', '1e-4', '--fp16'] exits with return code = -7
Expected Behavior
准备数据集 安装ds 进行微调
Steps To Reproduce
1.在Ubuntu环境 2.安装deepspeed 3.使用ChatGLM-6B微调
Environment
OS: Ubuntu 20.04
Python: 3.8
Transformers: 4.28.0
PyTorch: 1.12
CUDA Support: True
Anything else?
4个GPU,每个11G,我看异常中断时,CPU内存变的很少。GPU内存使用很少
[root@master-local ~]# nvidia-smi
Tue May 16 17:19:55 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.161.03 Driver Version: 470.161.03 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:18:00.0 Off | N/A |
| 24% 40C P0 63W / 250W | 0MiB / 11019MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce ... Off | 00000000:3B:00.0 Off | N/A |
| 24% 36C P0 54W / 250W | 0MiB / 11019MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA GeForce ... Off | 00000000:86:00.0 Off | N/A |
| 19% 40C P0 66W / 250W | 0MiB / 11019MiB | 1% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA GeForce ... Off | 00000000:AF:00.0 Off | N/A |
| 14% 47C P0 68W / 250W | 0MiB / 11019MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
可能是因为硬件不行
killing 或者killed 一般是内存不足(至少我见过都是)
一模一样的问题,莫名其妙的kill,没有其他报错信息。 torch 2.0 transformers 4.29.2 唯一可能怀疑的点是,目前是在GPU服务器上运维给开的ubuntu docker nvidia-smi能看到四张3090,跑的时候VRAM占用在726M,也不是因为内存满了。
运行: [INFO|tokenization_utils_base.py:1808] 2023-05-20 11:34:29,663 >> loading file ice_text.model [INFO|tokenization_utils_base.py:1808] 2023-05-20 11:34:29,663 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:1808] 2023-05-20 11:34:29,663 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:1808] 2023-05-20 11:34:29,663 >> loading file tokenizer_config.json [INFO|modeling_utils.py:2513] 2023-05-20 11:34:30,013 >> loading weights file ../model/0417/pytorch_model.bin.index.json [INFO|configuration_utils.py:577] 2023-05-20 11:34:30,014 >> Generate config GenerationConfig { "_from_model_config": true, "bos_token_id": 130004, "eos_token_id": 130005, "pad_token_id": 3, "transformers_version": "4.29.2" }
Loading checkpoint shards: 12%|█████████▎ | 1/8 [00:01<00:10, 1.49s/it]05/20/2023 11:34:31 - WARNING - datasets.builder - Found cached dataset json (/root/.cache/huggingface/datasets/json/default-6deeb7fde503f793/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4) 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 422.47it/s] Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████| 8/8 [00:08<00:00, 1.01s/it] Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████| 8/8 [00:12<00:00, 1.55s/it] Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████| 8/8 [00:12<00:00, 1.54s/it] [INFO|modeling_utils.py:3185] 2023-05-20 11:34:42,520 >> All model checkpoint weights were used when initializing ChatGLMForConditionalGeneration.
[INFO|modeling_utils.py:3193] 2023-05-20 11:34:42,520 >> All the weights of ChatGLMForConditionalGeneration were initialized from the model checkpoint at ../model/0417.
If your task is similar to the task the model of the checkpoint was trained on, you can already use ChatGLMForConditionalGeneration for predictions without further training.
[INFO|modeling_utils.py:2821] 2023-05-20 11:34:42,588 >> Generation config file not found, using a generation config created from the model config.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████| 8/8 [00:12<00:00, 1.58s/it]
[2023-05-20 11:42:18,856] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 18729
[2023-05-20 11:42:18,857] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 18730
[2023-05-20 11:42:23,613] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 18731
[2023-05-20 11:42:23,615] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 18732
[2023-05-20 11:42:23,617] [ERROR] [launch.py:434:sigkill_handler] ['/root/anaconda3/envs/py39/bin/python', '-u', 'main.py', '--local_rank=3', '--deepspeed', 'deepspeed.json', '--do_train', '--train_file', 'data/AdvertiseGen/train.json', '--test_file', 'data/AdvertiseGen/dev.json', '--prompt_column', 'content', '--response_column', 'summary', '--overwrite_cache', '--model_name_or_path', '../model/0417', '--output_dir', './output/adgen-chatglm-6b-fullft-demo-1e-4', '--overwrite_output_dir', '--max_source_length', '64', '--max_target_length', '64', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '1', '--gradient_accumulation_steps', '1', '--predict_with_generate', '--max_steps', '5000', '--logging_steps', '10', '--save_steps', '1000', '--learning_rate', '1e-4', '--fp16'] exits with return code = -7
一个是内存问题,一个是共享内存问题,遇到过,调整后解决了
@TommyWongww 请问大佬,应该怎么调整呢
killing 或者killed 一般是内存不足(至少我见过都是)
CPU内存么