v1.1版本deepspeed微调报错RuntimeError: output tensor must have the same type as input tensor
Is there an existing issue for this?
- [X] I have searched the existing issues
Current Behavior
(pytorchzdy) [work@gpu-2 chat_generate]$ sh dp_train_glm.sh
[2023-05-17 14:37:02,196] [INFO] [runner.py:299:parse_resource_filter] removing 0 from gpu-2
[2023-05-17 14:37:02,197] [INFO] [runner.py:299:parse_resource_filter] removing 1 from gpu-2
[2023-05-17 14:37:02,197] [INFO] [runner.py:299:parse_resource_filter] removing 2 from gpu-2
[2023-05-17 14:37:02,197] [INFO] [runner.py:299:parse_resource_filter] removing 3 from gpu-2
[2023-05-17 14:37:02,197] [INFO] [runner.py:299:parse_resource_filter] removing 4 from gpu-2
[2023-05-17 14:37:02,197] [INFO] [runner.py:299:parse_resource_filter] removing 5 from gpu-2
[2023-05-17 14:37:14,982] [INFO] [runner.py:454:main] Using IP address of 192.168.10.82 for node gpu-2
[2023-05-17 14:37:14,982] [INFO] [runner.py:550:main] cmd = /home/work/.conda/envs/pytorchzdy/bin/python -u -m deepspeed.launcher.launch --world_info=eyJncHUtMiI6IFs2LCA3XX0= --master_addr=192.168.10.82 --master_port=29500 --enable_each_rank_log=None dp_finetune.py --deepspeed ./config/deepspeed/ds_glm.json --model chatglm --model_path ./chatglm-6b --data_path data/instinwild_ch.json --max_datasets_size 10000 --max_len 128 --lora_rank 0 --pre_seq_len 128 --logging_steps 10 --num_train_epochs 1 --learning_rate 2e-2 --output_dir ./output/chatglm-6b --gradient_accumulation_steps 1 --per_device_train_batch_size 2 --per_device_eval_batch_size 1 --predict_with_generate --max_steps 3000 --save_steps 1000 --grad_checkpointing
[2023-05-17 14:37:18,127] [INFO] [launch.py:142:main] WORLD INFO DICT: {'gpu-2': [6, 7]}
[2023-05-17 14:37:18,127] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=2, node_rank=0
[2023-05-17 14:37:18,127] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'gpu-2': [0, 1]})
[2023-05-17 14:37:18,127] [INFO] [launch.py:162:main] dist_world_size=2
[2023-05-17 14:37:18,127] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=6,7
[2023-05-17 14:37:25,621] [INFO] [comm.py:652:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[INFO] [05/17/2023 14:37:28] [torch.distributed.distributed_c10d] Added key: store_based_barrier_key:1 to store for rank: 0
[INFO] [05/17/2023 14:37:28] [torch.distributed.distributed_c10d] Added key: store_based_barrier_key:1 to store for rank: 1
[INFO] [05/17/2023 14:37:28] [torch.distributed.distributed_c10d] Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
[INFO] [05/17/2023 14:37:28] [torch.distributed.distributed_c10d] Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
[INFO] [05/17/2023 14:37:28] [main] Loading model, config and tokenizer ...
Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
[INFO] [05/17/2023 14:37:28] [main] Loading model, config and tokenizer ...
Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
[INFO] [05/17/2023 14:37:28] [main] Use P-Tuning v2 to fine-tune model
Explicitly passing a revision is encouraged when loading a configuration with custom code to ensure no malicious code has been contributed in a newer revision.
Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
[INFO] [05/17/2023 14:37:28] [main] Use P-Tuning v2 to fine-tune model
Explicitly passing a revision is encouraged when loading a configuration with custom code to ensure no malicious code has been contributed in a newer revision.
Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
[2023-05-17 14:37:41,891] [INFO] [partition_parameters.py:415:exit] finished initializing model with 6.74B parameters
Loading checkpoint shards: 0%| | 0/8 [00:00<?, ?it/s]/home/work/.conda/envs/pytorchzdy/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed.all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
warnings.warn(
/home/work/.conda/envs/pytorchzdy/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed.all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
warnings.warn(
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:21<00:00, 2.69s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:21<00:00, 2.69s/it]
[INFO] [05/17/2023 14:38:03] [main] Loading dataset ...
[INFO] [05/17/2023 14:38:03] [dataset.data_loader] Building chaglm dataloaders
[INFO] [05/17/2023 14:38:03] [dataset.chat_dataset] Loading json data: data/instinwild_ch.json
[INFO] [05/17/2023 14:38:03] [main] Loading dataset ...
[INFO] [05/17/2023 14:38:03] [dataset.data_loader] Building chaglm dataloaders
[INFO] [05/17/2023 14:38:03] [dataset.chat_dataset] Loading json data: data/instinwild_ch.json
[WARNING] [05/17/2023 14:38:04] [datasets.builder] Found cached dataset json (/home/work/.cache/huggingface/datasets/json/default-a8d4b15460af874d/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e)
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 259.85it/s]
[INFO] [05/17/2023 14:38:04] [dataset.chat_dataset] Loaded 51504 examples.
[INFO] [05/17/2023 14:38:04] [dataset.chat_dataset] Limiting dataset to 10000 examples.
[INFO] [05/17/2023 14:38:04] [dataset.chat_dataset] Formatting ChatGLM inputs ...
[WARNING] [05/17/2023 14:38:04] [datasets.builder] Found cached dataset json (/home/work/.cache/huggingface/datasets/json/default-a8d4b15460af874d/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e)
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 266.51it/s]
[INFO] [05/17/2023 14:38:04] [dataset.chat_dataset] Loaded 51504 examples.
[INFO] [05/17/2023 14:38:04] [dataset.chat_dataset] Limiting dataset to 10000 examples.
[INFO] [05/17/2023 14:38:04] [dataset.chat_dataset] Formatting ChatGLM inputs ...
[INFO] [05/17/2023 14:38:06] [dataset.chat_dataset] Tokenizing inputs ...
Dataset: 0%| | 0/10000 [00:00<?, ?it/s][INFO] [05/17/2023 14:38:07] [dataset.chat_dataset] Tokenizing inputs ...
Dataset: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:05<00:00, 1812.93it/s]
[input_ids]:
[5, 64286, 12, 64157, 68896, 64185, 66731, 79046, 64230, 69551, 63823, 4, 67342, 12, 130001, 130004, 5, 64276, 67446, 66731, 79046, 64230, 69551, 78511, 66136, 71738, 6, 83710, 63878, 63979, 83428, 63871, 71738, 6, 64651, 87699, 66265, 64748, 6, 66544, 66351, 64950, 63824, 81883, 63826, 71821, 84329, 63825, 70797, 63823, 66018, 6, 63878, 74017, 64351, 97701, 63944, 80799, 64288, 64786, 63944, 66057, 64258, 6, 64529, 91081, 6, 64024, 71377, 63835, 6, 64542, 77280, 6, 64287, 66715, 63841, 64987, 65878, 86089, 6, 63899, 80732, 66372, 64700, 6, 97827, 69514, 74470, 63823, 130005, 130005, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]
[inputs] :
问:请讲解如何缓解上班族病的症状。
答: 一种有效的缓解上班族病的症状方法是出去散步,每天晚上可以花几个小时去散步,减少坐姿固定的时间,放松肩痛、腰痛和背部综合症的发作。另外,可以试着利用午休时间或者其他空余时间锻炼一下,比如慢跑,打太极拳等,帮助舒缓,运动释放时也可以练习深呼吸,这能帮助消除压力,更有利于解除病症。
[label_ids]:
[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 130004, 5, 64276, 67446, 66731, 79046, 64230, 69551, 78511, 66136, 71738, 6, 83710, 63878, 63979, 83428, 63871, 71738, 6, 64651, 87699, 66265, 64748, 6, 66544, 66351, 64950, 63824, 81883, 63826, 71821, 84329, 63825, 70797, 63823, 66018, 6, 63878, 74017, 64351, 97701, 63944, 80799, 64288, 64786, 63944, 66057, 64258, 6, 64529, 91081, 6, 64024, 71377, 63835, 6, 64542, 77280, 6, 64287, 66715, 63841, 64987, 65878, 86089, 6, 63899, 80732, 66372, 64700, 6, 97827, 69514, 74470, 63823, 130005, 130005, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100]
[labels] :
<image-100><image-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100> 一种有效的缓解上班族病的症状方法是出去散步,每天晚上可以花几个小时去散步,减少坐姿固定的时间,放松肩痛、腰痛和背部综合症的发作。另外,可以试着利用午休时间或者其他空余时间锻炼一下,比如慢跑,打太极拳等,帮助舒缓,运动释放时也可以练习深呼吸,这能帮助消除压力,更有利于解除病症。<image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100>
[INFO] [05/17/2023 14:38:12] [dataset.data_loader] Loaded 9000 training examples, 1000 evaluation examples
[input_ids]:
[5, 64286, 12, 64157, 64201, 73848, 70522, 71039, 70022, 71529, 12, 7457, 63824, 11329, 63824, 4218, 63824, 3802, 63824, 49241, 63823, 4, 67342, 12, 130001, 130004, 11632, 63824, 49241, 63824, 3802, 63824, 11329, 63824, 4218, 130005, 130005, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]
[inputs] :
问:请按字母顺序排列下列单词:apple、dog、tree、cat、banana。
答: apple、banana、cat、dog、tree
[label_ids]:
[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 130004, 11632, 63824, 49241, 63824, 3802, 63824, 11329, 63824, 4218, 130005, 130005, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100]
[labels ]:
<image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100> apple、banana、cat、dog、tree<image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100>
[INFO] [05/17/2023 14:38:12] [main] Start to train ...
[INFO] [05/17/2023 14:38:12] [main] Training argments: Seq2SeqTrainingArguments(
n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=./config/deepspeed/ds_glm.json,
disable_tqdm=False,
do_eval=False,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=no,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'fsdp_min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
generation_config=None,
generation_max_length=None,
generation_num_beams=None,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=0.02,
length_column_name=length,
load_best_model_at_end=False,
local_rank=1,
log_level=passive,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=./output/chatglm-6b/runs/May17_14-37-25_gpu-2,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=10,
logging_strategy=steps,
lr_scheduler_type=linear,
max_grad_norm=1.0,
max_steps=3000,
metric_for_best_model=None,
mp_parameters=,
no_cuda=False,
num_train_epochs=1.0,
optim=adamw_hf,
optim_args=None,
output_dir=./output/chatglm-6b,
overwrite_output_dir=False,
past_index=-1,
per_device_eval_batch_size=1,
per_device_train_batch_size=2,
predict_with_generate=True,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=True,
report_to=['tensorboard', 'wandb'],
resume_from_checkpoint=None,
run_name=./output/chatglm-6b,
save_on_each_node=False,
save_safetensors=False,
save_steps=1000,
save_strategy=steps,
save_total_limit=None,
seed=42,
sharded_ddp=[],
skip_memory_metrics=True,
sortish_sampler=False,
tf32=None,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.0,
warmup_steps=0,
weight_decay=0.0,
xpu_backend=None,
)
[INFO] [05/17/2023 14:38:12] [torch.distributed.distributed_c10d] Added key: store_based_barrier_key:2 to store for rank: 1
Dataset: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:05<00:00, 1778.27it/s]
[input_ids]:
[5, 64286, 12, 64157, 68896, 64185, 66731, 79046, 64230, 69551, 63823, 4, 67342, 12, 130001, 130004, 5, 64276, 67446, 66731, 79046, 64230, 69551, 78511, 66136, 71738, 6, 83710, 63878, 63979, 83428, 63871, 71738, 6, 64651, 87699, 66265, 64748, 6, 66544, 66351, 64950, 63824, 81883, 63826, 71821, 84329, 63825, 70797, 63823, 66018, 6, 63878, 74017, 64351, 97701, 63944, 80799, 64288, 64786, 63944, 66057, 64258, 6, 64529, 91081, 6, 64024, 71377, 63835, 6, 64542, 77280, 6, 64287, 66715, 63841, 64987, 65878, 86089, 6, 63899, 80732, 66372, 64700, 6, 97827, 69514, 74470, 63823, 130005, 130005, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]
[inputs] :
问:请讲解如何缓解上班族病的症状。
答: 一种有效的缓解上班族病的症状方法是出去散步,每天晚上可以花几个小时去散步,减少坐姿固定的时间,放松肩痛、腰痛和背部综合症的发作。另外,可以试着利用午休时间或者其他空余时间锻炼一下,比如慢跑,打太极拳等,帮助舒缓,运动释放时也可以练习深呼吸,这能帮助消除压力,更有利于解除病症。
[label_ids]:
[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 130004, 5, 64276, 67446, 66731, 79046, 64230, 69551, 78511, 66136, 71738, 6, 83710, 63878, 63979, 83428, 63871, 71738, 6, 64651, 87699, 66265, 64748, 6, 66544, 66351, 64950, 63824, 81883, 63826, 71821, 84329, 63825, 70797, 63823, 66018, 6, 63878, 74017, 64351, 97701, 63944, 80799, 64288, 64786, 63944, 66057, 64258, 6, 64529, 91081, 6, 64024, 71377, 63835, 6, 64542, 77280, 6, 64287, 66715, 63841, 64987, 65878, 86089, 6, 63899, 80732, 66372, 64700, 6, 97827, 69514, 74470, 63823, 130005, 130005, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100]
[labels] :
<image-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100> 一种有效的缓解上班族病的症状方法是出去散步,每天晚上可以花几个小时去散步,减少坐姿固定的时间,放松肩痛、腰痛和背部综合症的发作。另外,可以试着利用午休时间或者其他空余时间锻炼一下,比如慢跑,打太极拳等,帮助舒缓,运动释放时也可以练习深呼吸,这能帮助消除压力,更有利于解除病症。<image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100>
[INFO] [05/17/2023 14:38:12] [dataset.data_loader] Loaded 9000 training examples, 1000 evaluation examples
[input_ids]:
[5, 64286, 12, 64157, 64201, 73848, 70522, 71039, 70022, 71529, 12, 7457, 63824, 11329, 63824, 4218, 63824, 3802, 63824, 49241, 63823, 4, 67342, 12, 130001, 130004, 11632, 63824, 49241, 63824, 3802, 63824, 11329, 63824, 4218, 130005, 130005, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]
[inputs] :
问:请按字母顺序排列下列单词:apple、dog、tree、cat、banana。
答: apple、banana、cat、dog、tree
[label_ids]:
[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 130004, 11632, 63824, 49241, 63824, 3802, 63824, 11329, 63824, 4218, 130005, 130005, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100]
[labels ]:
<image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100> apple、banana、cat、dog、tree<image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100><image_-100>
[INFO] [05/17/2023 14:38:12] [main] Start to train ...
[INFO] [05/17/2023 14:38:12] [main] Training argments: Seq2SeqTrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=./config/deepspeed/ds_glm.json,
disable_tqdm=False,
do_eval=False,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=no,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'fsdp_min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
generation_config=None,
generation_max_length=None,
generation_num_beams=None,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=0.02,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_level=passive,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=./output/chatglm-6b/runs/May17_14-37-25_gpu-2,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=10,
logging_strategy=steps,
lr_scheduler_type=linear,
max_grad_norm=1.0,
max_steps=3000,
metric_for_best_model=None,
mp_parameters=,
no_cuda=False,
num_train_epochs=1.0,
optim=adamw_hf,
optim_args=None,
output_dir=./output/chatglm-6b,
overwrite_output_dir=False,
past_index=-1,
per_device_eval_batch_size=1,
per_device_train_batch_size=2,
predict_with_generate=True,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=True,
report_to=['tensorboard', 'wandb'],
resume_from_checkpoint=None,
run_name=./output/chatglm-6b,
save_on_each_node=False,
save_safetensors=False,
save_steps=1000,
save_strategy=steps,
save_total_limit=None,
seed=42,
sharded_ddp=[],
skip_memory_metrics=True,
sortish_sampler=False,
tf32=None,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.0,
warmup_steps=0,
weight_decay=0.0,
xpu_backend=None,
)
[INFO] [05/17/2023 14:38:12] [torch.distributed.distributed_c10d] Added key: store_based_barrier_key:2 to store for rank: 0
[INFO] [05/17/2023 14:38:12] [torch.distributed.distributed_c10d] Rank 0: Completed store-based barrier for key:store_based_barrier_key:2 with 2 nodes.
[INFO] [05/17/2023 14:38:12] [torch.distributed.distributed_c10d] Rank 1: Completed store-based barrier for key:store_based_barrier_key:2 with 2 nodes.
Using /home/work/.cache/torch_extensions/py310_cu116 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/work/.cache/torch_extensions/py310_cu116/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 0.46748924255371094 seconds
Using /home/work/.cache/torch_extensions/py310_cu116 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/work/.cache/torch_extensions/py310_cu116/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 0.444105863571167 seconds
Using /home/work/.cache/torch_extensions/py310_cu116 as PyTorch extensions root...
Emitting ninja build file /home/work/.cache/torch_extensions/py310_cu116/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.020650863647460938 seconds
Using /home/work/.cache/torch_extensions/py310_cu116 as PyTorch extensions root...
Emitting ninja build file /home/work/.cache/torch_extensions/py310_cu116/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.01944255828857422 seconds
Parameter Offload: Total persistent parameters: 1499136 in 226 params
Using /home/work/.cache/torch_extensions/py310_cu116 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0006818771362304688 seconds
/home/work/.conda/envs/pytorchzdy/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
warnings.warn(
Traceback (most recent call last):
File "/home/work/liuwc/chat_generate/dp_finetune.py", line 321, in
Expected Behavior
正常运行
Steps To Reproduce
- 使用了P-Tuning
- 使用了trainer = Seq2SeqTrainer( model=model, args=training_args, train_dataset=train_dataset, eval_dataset=valid_dataset, tokenizer=tokenizer, data_collator=data_collator, compute_metrics=None )
- trainer.train()报错
Environment
PyTorch version: 1.13.1
Is debug build: False
CUDA used to build PyTorch: 11.6
ROCM used to build PyTorch: N/A
OS: CentOS Linux 7 (Core) (x86_64)
GCC version: (GCC) 6.1.0
Clang version: Could not collect
CMake version: version 3.26.1
Libc version: glibc-2.17
Python version: 3.10.10 (main, Mar 21 2023, 18:45:11) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-3.10.0-862.6.3.el7.x86_64-x86_64-with-glibc2.17
Is CUDA available: True
CUDA runtime version: 11.6.55
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA GeForce GTX 1080 Ti
GPU 1: NVIDIA GeForce GTX 1080 Ti
GPU 2: NVIDIA GeForce GTX 1080 Ti
GPU 3: NVIDIA GeForce GTX 1080 Ti
GPU 4: NVIDIA GeForce GTX 1080 Ti
GPU 5: NVIDIA GeForce GTX 1080 Ti
GPU 6: NVIDIA GeForce GTX 1080 Ti
GPU 7: NVIDIA GeForce GTX 1080 Ti
Nvidia driver version: 510.85.02
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Versions of relevant libraries:
[pip3] numpy==1.23.5
[pip3] pytorch-accelerated==0.1.45
[pip3] pytorch-pretrained-bert==0.6.2
[pip3] torch==1.13.1
[pip3] torchaudio==0.13.1
[pip3] torchvision==0.14.1
[conda] blas 1.0 mkl
[conda] ffmpeg 4.3 hf484d3e_0 pytorch
[conda] mkl 2021.4.0 h06a4308_640
[conda] mkl-service 2.4.0 py310h7f8727e_0
[conda] mkl_fft 1.3.1 py310hd6ae3a3_0
[conda] mkl_random 1.2.2 py310h00e6091_0
[conda] numpy 1.23.5 py310hd5efca6_0
[conda] numpy-base 1.23.5 py310h8e6c178_0
[conda] pytorch 1.13.1 py3.10_cuda11.6_cudnn8.3.2_0 pytorch
[conda] pytorch-accelerated 0.1.45 pypi_0 pypi
[conda] pytorch-cuda 11.6 h867d48c_1 pytorch
[conda] pytorch-mutex 1.0 cuda pytorch
[conda] pytorch-pretrained-bert 0.6.2 pypi_0 pypi
[conda] torchaudio 0.13.1 py310_cu116 pytorch
[conda] torchvision 0.14.1 py310_cu116 pytorch
Anything else?
{ "train_micro_batch_size_per_gpu": "auto", "zero_allow_untested_optimizer": true, "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "fp16": { "enabled": "auto", "loss_scale": 0, "initial_scale_power": 16, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1 }, "bf16": { "enabled": "auto" }, "optimizer": { "type": "AdamW", "params": { "lr": "auto", "betas": "auto", "eps": "auto", "weight_decay": "auto" } }, "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "offload_param": { "device": "cpu", "pin_memory": true }, "allgather_partitions": true, "allgather_bucket_size": 5e8, "reduce_scatter": true, "contiguous_gradients" : true, "overlap_comm": true, "sub_group_size": 1e9, "reduce_bucket_size": "auto", "stage3_prefetch_bucket_size": "auto", "stage3_param_persistence_threshold": "auto", "stage3_max_live_parameters": 1e9, "stage3_max_reuse_distance": 1e9, "stage3_gather_16bit_weights_on_model_save": true } }
I have the same problem, https://github.com/microsoft/DeepSpeed/issues/3654