LLaMA-Factory icon indicating copy to clipboard operation
LLaMA-Factory copied to clipboard

用最新的代码全量微调llama-3-70B报错

Open heshuguo opened this issue 10 months ago • 1 comments

Reminder

  • [X] I have read the README and searched the existing issues.

Reproduction

运行命令: deepspeed --num_gpus=8 src/train_bash.py --stage sft --model_name_or_path /train/Llama-3-70B --do_train --dataset thp --finetuning_type full --output_dir llama3_0419 --overwrite_cache --per_device_train_batch_size 16 --per_device_eval_batch_size 16 --gradient_accumulation_steps 8 --preprocessing_num_workers 16 --lr_scheduler_type cosine --logging_steps 10 --save_steps 10 --eval_steps 10 --val_size 1000 --learning_rate 5e-6 --max_grad_norm 0.5 --num_train_epochs 3.0 --evaluation_strategy steps --load_best_model_at_end --plot_loss --bf16 --template default --deepspeed deepspeed_3.json

[INFO|trainer.py:2057] 2024-04-20 09:12:27,424 >> Number of trainable parameters = 70,553,706,496 0%| | 0/123 [00:00<?, ?it/s]Traceback (most recent call last): File "/train_new/github/LLaMA-Factory/src/train_bash.py", line 14, in main() File "/train_new/github/LLaMA-Factory/src/train_bash.py", line 5, in main run_exp() File "/train_new/github/LLaMA-Factory/src/llmtuner/train/tuner.py", line 33, in run_exp run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks) File "/train_new/github/LLaMA-Factory/src/llmtuner/train/sft/workflow.py", line 71, in run_sft train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint) File "/root/miniconda3/envs/llm/lib/python3.10/site-packages/transformers/trainer.py", line 1859, in train return inner_training_loop( File "/root/miniconda3/envs/llm/lib/python3.10/site-packages/transformers/trainer.py", line 2203, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/root/miniconda3/envs/llm/lib/python3.10/site-packages/transformers/trainer.py", line 3147, in training_step self.accelerator.backward(loss) File "/root/miniconda3/envs/llm/lib/python3.10/site-packages/accelerate/accelerator.py", line 2007, in backward self.deepspeed_engine_wrapped.backward(loss, **kwargs) File "/root/miniconda3/envs/llm/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 175, in backward self.engine.step() File "/root/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2169, in step self._take_model_step(lr_kwargs) File "/root/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2075, in take_model_step self.optimizer.step() File "/root/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/root/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2047, in step self.unscale_and_clip_grads(sub_group_id, scaled_global_grad_norm) File "/root/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/root/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2117, in unscale_and_clip_grads self.fp32_partitioned_groups_flat[sub_group_id].grad.mul(1. / combined_scale) RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

Expected behavior

No response

System Info

No response

Others

No response

deepspeed版本0.14.0

heshuguo avatar Apr 20 '24 01:04 heshuguo

尝试一下其他的 deepspeed 版本

hiyouga avatar Apr 20 '24 02:04 hiyouga

https://github.com/hiyouga/LLaMA-Factory/issues/2493#issuecomment-1950971296

hiyouga avatar Apr 21 '24 16:04 hiyouga