模型:llama 13B
报错信息:
[INFO|trainer.py:1769] 2023-05-16 17:10:32,378 >> ***** Running training *****
[INFO|trainer.py:1770] 2023-05-16 17:10:32,378 >> Num examples = 222,193
[INFO|trainer.py:1771] 2023-05-16 17:10:32,378 >> Num Epochs = 6
[INFO|trainer.py:1772] 2023-05-16 17:10:32,378 >> Instantaneous batch size per device = 2
[INFO|trainer.py:1773] 2023-05-16 17:10:32,378 >> Total train batch size (w. parallel, distributed & accumulation) = 128
[INFO|trainer.py:1774] 2023-05-16 17:10:32,378 >> Gradient Accumulation steps = 8
[INFO|trainer.py:1775] 2023-05-16 17:10:32,378 >> Total optimization steps = 9,000
[INFO|trainer.py:1776] 2023-05-16 17:10:32,380 >> Number of trainable parameters = 262,312,960
0%| | 0/9000 [00:00<?, ?it/s]Traceback (most recent call last):
File "/workspace/fumengen/works/Chinese-LLaMA-Alpaca/scripts/run_clm_pt_with_peft.py", line 620, in
main()
File "/workspace/fumengen/works/Chinese-LLaMA-Alpaca/scripts/run_clm_pt_with_peft.py", line 585, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/workspace/fumengen/vir_fme/lib/python3.10/site-packages/transformers/trainer.py", line 1662, in train
return inner_training_loop(
File "/workspace/fumengen/vir_fme/lib/python3.10/site-packages/transformers/trainer.py", line 1929, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/workspace/fumengen/vir_fme/lib/python3.10/site-packages/transformers/trainer.py", line 2699, in training_step
loss = self.compute_loss(model, inputs)
File "/workspace/fumengen/vir_fme/lib/python3.10/site-packages/transformers/trainer.py", line 2731, in compute_loss
outputs = model(**inputs)
File "/workspace/fumengen/vir_fme/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/workspace/fumengen/vir_fme/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 171, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/workspace/fumengen/vir_fme/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 181, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/workspace/fumengen/vir_fme/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 89, in parallel_apply
output.reraise()
File "/workspace/fumengen/vir_fme/lib/python3.10/site-packages/torch/_utils.py", line 543, in reraise
raise exception
RuntimeError: Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
File "/workspace/fumengen/vir_fme/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 64, in _worker
output = module(*input, **kwargs)
File "/workspace/fumengen/vir_fme/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/workspace/fumengen/vir_fme/lib/python3.10/site-packages/peft/peft_model.py", line 529, in forward
return self.base_model(
File "/workspace/fumengen/vir_fme/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/workspace/fumengen/vir_fme/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 687, in forward
outputs = self.model(
File "/workspace/fumengen/vir_fme/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/workspace/fumengen/vir_fme/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 530, in forward
inputs_embeds = self.embed_tokens(input_ids)
File "/workspace/fumengen/vir_fme/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/workspace/fumengen/vir_fme/lib/python3.10/site-packages/torch/nn/modules/sparse.py", line 160, in forward
return F.embedding(
File "/workspace/fumengen/vir_fme/lib/python3.10/site-packages/torch/nn/functional.py", line 2210, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument index in method wrapper__index_select)
0%| | 0/9000 [00:17<?, ?it/s]
请问怎么解决?
data_cache=/workspace/fumengen/works/Chinese-LLaMA-Alpaca/data/data_tmp
dataset_dir=/workspace/fumengen/works/Chinese-LLaMA-Alpaca/data/cn_pretrain_data
output_dir=/workspace/fumengen/works/Chinese-LLaMA-Alpaca/outputs
pretrained_model=/workspace/fumengen/works/Chinese-LLaMA-Alpaca/models/chinese_llama_13b
chinese_tokenizer_path=/workspace/fumengen/works/Chinese-LLaMA-Alpaca/models/chinese_llama_13b
python3 scripts/run_clm_pt_with_peft.py
--model_name_or_path ${pretrained_model}
--tokenizer_name_or_path ${chinese_tokenizer_path}
--dataset_dir ${dataset_dir}
--data_cache_dir $data_cache
--output_dir ${output_dir}
--per_device_train_batch_size 2
--per_device_eval_batch_size 2
--gradient_accumulation_steps 8
--max_steps 9000
--learning_rate 2e-4
--lora_rank 8
--trainable "q_proj,v_proj"
--modules_to_save "embed_tokens"
--lora_dropout 0.05
--validation_split_percentage 0.001
--do_train
--seed 3407
--fp16
--lr_scheduler_type cosine
--warmup_ratio 0.05
--weight_decay 0.01
--logging_strategy steps
--logging_steps 10
--save_strategy steps
--save_total_limit 3
--save_steps 500
--preprocessing_num_workers 8
--block_size 512
--overwrite_output_dir
--ddp_timeout 30000
--logging_first_step True
--torch_dtype float16
建议使用deepspeed,可以参考我们最新的预训练提交脚本
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your consideration.
Closing the issue, since no updates observed. Feel free to re-open if you need any further assistance.