LLaMA-Factory
LLaMA-Factory copied to clipboard
微调Qwen1.5-72B之后,偶现safetensors_rust.SafetensorError: Error while deserializing header: MetadataIncompleteBuffer
Reminder
- [X] I have read the README and searched the existing issues.
Reproduction
[00:31<01:14, 3.55s/it] Loading checkpoint shards: 30%|███ | 9/30 [00:31<01:14, 3.53s/it] 2024-03-22T02:25:50.767326683Z Traceback (most recent call last): 2024-03-22T02:25:50.767862627Z model = AutoModelForCausalLM.from_pretrained( 2024-03-22T02:25:50.767865840Z File "/usr/local/lib/python3.10/dist-packages/modelscope/utils/hf_util.py", line 113, in from_pretrained 2024-03-22T02:25:50.767916380Z module_obj = module_class.from_pretrained(model_dir, *model_args, 2024-03-22T02:25:50.767919696Z File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 561, in from_pretrained 2024-03-22T02:25:50.768007439Z return model_class.from_pretrained( 2024-03-22T02:25:50.768010504Z File "/usr/local/lib/python3.10/dist-packages/modelscope/utils/hf_util.py", line 76, in from_pretrained 2024-03-22T02:25:50.768030485Z return ori_from_pretrained(cls, model_dir, *model_args, **kwargs) 2024-03-22T02:25:50.768033991Z File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 3502, in from_pretrained 2024-03-22T02:25:50.768397494Z ) = cls._load_pretrained_model( 2024-03-22T02:25:50.768400673Z File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 3903, in _load_pretrained_model 2024-03-22T02:25:50.768776297Z state_dict = load_state_dict(shard_file) 2024-03-22T02:25:50.768779551Z File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 505, in load_state_dict 2024-03-22T02:25:50.768832747Z with safe_open(checkpoint_file, framework="pt") as f: 2024-03-22T02:25:50.768868897Z safetensors_rust.SafetensorError: Error while deserializing header: MetadataIncompleteBuffer
Expected behavior
模型训练过程无报错,使用训练好的模型后出现MetadataIncompleteBuffer。
System Info
No response
Others
No response
accelerate launch --config_file $ACCELERATE_CONFIG_FILE --num_processes $NUM_PROCESSES --num_machines $WORLD_SIZE --machine_rank $RANK --main_process_ip $MASTER_ADDR --main_process_port $MASTER_PORT
src/train_bash.py
--stage sft
--do_train
--model_name_or_path $MODEL_PATH
--dataset alpaca_zh
--template qwen
--finetuning_type full
--output_dir $OUTPUT_PATH
--overwrite_cache
--preprocessing_num_workers 16
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--gradient_accumulation_steps 4
--lr_scheduler_type cosine
--logging_steps 10
--warmup_ratio 0.03
--save_steps 1000
--max_grad_norm 1.0
--learning_rate 5e-6
--num_train_epochs 3.0
--plot_loss
--bf16
--flash_attn
--overwrite_output_dir
--cutoff_len 16384
--ddp_timeout 1800000
--gradient_checkpointing True
accelerate launch --config_file $ACCELERATE_CONFIG_FILE --num_processes $NUM_PROCESSES --num_machines $WORLD_SIZE --machine_rank $RANK --main_process_ip $MASTER_ADDR --main_process_port $MASTER_PORT src/train_bash.py --stage sft --do_train --model_name_or_path $MODEL_PATH --dataset alpaca_zh --template qwen --finetuning_type full --output_dir $OUTPUT_PATH --overwrite_cache --preprocessing_num_workers 16 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 4 --lr_scheduler_type cosine --logging_steps 10 --warmup_ratio 0.03 --save_steps 1000 --max_grad_norm 1.0 --learning_rate 5e-6 --num_train_epochs 3.0 --plot_loss --bf16 --flash_attn --overwrite_output_dir --cutoff_len 16384 --ddp_timeout 1800000 --gradient_checkpointing True
想问下你这里全量微调72B使用了多少资源呀?是多机多卡吗,比如2机8卡80GB吗?