LLaMA-Factory
LLaMA-Factory copied to clipboard
【deepspeed+galore】error in deepspeed with galore
Reminder
- [X] I have read the README and searched the existing issues.
Reproduction
通过accelerate config设置ds zero3的时候,提示以下错误,似乎和ds有某些兼容问题。另外求教galore训练可以多卡训练的的pp/tp方案?非常感谢 deepspeed版本:deepspeed 0.12.5+2ce6bf8c llama-factory已经git pull 至最新版本 训练sh脚本和Log如下所示。
Expected behavior
accelerate launch src/train_bash.py \
--stage pt \
--model_name_or_path /DATA4T/text-generation-webui/models/Yi-34B \
--do_train \
--dataset wiki_demo \
--template default \
--finetuning_type full \
--use_galore \
--galore_layerwise \
--galore_target mlp,self_attn \
--galore_rank 128 \
--output_dir /DATA4T/text-generation-webui/loras/deepmoney-2-34b-base-galore \
--overwrite_cache \
--overwrite_output_dir \
--cutoff_len 1024 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 1 \
--lr_scheduler_type cosine \
--logging_steps 10 \
--warmup_steps 20 \
--save_steps 100 \
--eval_steps 100 \
--evaluation_strategy steps \
--load_best_model_at_end \
--learning_rate 5e-5 \
--num_train_epochs 3.0 \
--max_samples 3000 \
--val_size 0.1 \
--plot_loss \
--pure_bf16
System Info
- num_machines: 1 - rdzv_backend: static - same_network: True - main_training_function: main - deepspeed_config: {'gradient_accumulation_steps': 1, 'gradient_clipping': 1.0, 'offload_optimizer_device': 'none', 'offload_param_device': 'none', 'zero3_init_flag': True, 'zero3_save_16bit_model': True, 'zero_stage': 3} - downcast_bf16: no - tpu_use_cluster: False - tpu_use_sudo: False - tpu_env: []
- PyTorch version (GPU?): 2.1.2+cu121 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?:
- Using distributed or parallel set-up in script?:
03/11/2024 07:49:10 - INFO - llmtuner.train.utils - Using GaLore optimizer, may cause hanging at the start of training, wait patiently.
Traceback (most recent call last):
File "/DATA4T/LLaMA-Factory/src/train_bash.py", line 14, in <module>
main()
File "/DATA4T/LLaMA-Factory/src/train_bash.py", line 5, in main
run_exp()
File "/DATA4T/LLaMA-Factory/src/llmtuner/train/tuner.py", line 30, in run_exp
run_pt(model_args, data_args, training_args, finetuning_args, callbacks)
File "/DATA4T/LLaMA-Factory/src/llmtuner/train/pt/workflow.py", line 34, in run_pt
trainer = Trainer(
File "/root/anaconda3/envs/llmfactory/lib/python3.10/site-packages/transformers/trainer.py", line 527, in __init__
raise RuntimeError(
RuntimeError: Passing `optimizers` is not allowed if Deepspeed or PyTorch FSDP is enabled. You should subclass `Trainer` and override the `create_optimizer_and_scheduler` method.
03/11/2024 07:49:10 - INFO - llmtuner.train.utils - Using GaLore optimizer, may cause hanging at the start of training, wait patiently.
Traceback (most recent call last):
File "/DATA4T/LLaMA-Factory/src/train_bash.py", line 14, in <module>
main()
File "/DATA4T/LLaMA-Factory/src/train_bash.py", line 5, in main
run_exp()
File "/DATA4T/LLaMA-Factory/src/llmtuner/train/tuner.py", line 30, in run_exp
run_pt(model_args, data_args, training_args, finetuning_args, callbacks)
File "/DATA4T/LLaMA-Factory/src/llmtuner/train/pt/workflow.py", line 34, in run_pt
trainer = Trainer(
File "/root/anaconda3/envs/llmfactory/lib/python3.10/site-packages/transformers/trainer.py", line 527, in __init__
raise RuntimeError(
RuntimeError: Passing `optimizers` is not allowed if Deepspeed or PyTorch FSDP is enabled. You should subclass `Trainer` and override the `create_optimizer_and_scheduler` method.
03/11/2024 07:49:10 - INFO - llmtuner.train.utils - Using GaLore optimizer, may cause hanging at the start of training, wait patiently.
Traceback (most recent call last):
File "/DATA4T/LLaMA-Factory/src/train_bash.py", line 14, in <module>
main()
File "/DATA4T/LLaMA-Factory/src/train_bash.py", line 5, in main
run_exp()
File "/DATA4T/LLaMA-Factory/src/llmtuner/train/tuner.py", line 30, in run_exp
run_pt(model_args, data_args, training_args, finetuning_args, callbacks)
File "/DATA4T/LLaMA-Factory/src/llmtuner/train/pt/workflow.py", line 34, in run_pt
trainer = Trainer(
File "/root/anaconda3/envs/llmfactory/lib/python3.10/site-packages/transformers/trainer.py", line 527, in __init__
raise RuntimeError(
RuntimeError: Passing `optimizers` is not allowed if Deepspeed or PyTorch FSDP is enabled. You should subclass `Trainer` and override the `create_optimizer_and_scheduler` method.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████| 7/7 [00:45<00:00, 6.56s/it]
[INFO|modeling_utils.py:3992] 2024-03-11 07:49:10,685 >> All model checkpoint weights were used when initializing LlamaForCausalLM.
[INFO|modeling_utils.py:4000] 2024-03-11 07:49:10,685 >> All the weights of LlamaForCausalLM were initialized from the model checkpoint at /DATA4T/text-generation-webui/models/Yi-34B.
If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training.
[INFO|configuration_utils.py:798] 2024-03-11 07:49:10,689 >> loading configuration file /DATA4T/text-generation-webui/models/Yi-34B/generation_config.json
[INFO|configuration_utils.py:845] 2024-03-11 07:49:10,690 >> Generate config GenerationConfig {
"bos_token_id": 1,
"eos_token_id": 2,
"pad_token_id": 0
}
03/11/2024 07:49:10 - INFO - llmtuner.model.patcher - Gradient checkpointing enabled.
03/11/2024 07:49:10 - INFO - llmtuner.model.adapter - Fine-tuning method: Full
03/11/2024 07:49:10 - INFO - llmtuner.model.loader - trainable params: 34388917248 || all params: 34388917248 || trainable%: 100.0000
/root/anaconda3/envs/llmfactory/lib/python3.10/site-packages/galore_torch/adamw.py:48: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
warnings.warn(
03/11/2024 07:49:11 - INFO - llmtuner.train.utils - Using GaLore optimizer, may cause hanging at the start of training, wait patiently.
Traceback (most recent call last):
File "/DATA4T/LLaMA-Factory/src/train_bash.py", line 14, in <module>
main()
File "/DATA4T/LLaMA-Factory/src/train_bash.py", line 5, in main
run_exp()
File "/DATA4T/LLaMA-Factory/src/llmtuner/train/tuner.py", line 30, in run_exp
run_pt(model_args, data_args, training_args, finetuning_args, callbacks)
File "/DATA4T/LLaMA-Factory/src/llmtuner/train/pt/workflow.py", line 34, in run_pt
trainer = Trainer(
File "/root/anaconda3/envs/llmfactory/lib/python3.10/site-packages/transformers/trainer.py", line 527, in __init__
raise RuntimeError(
RuntimeError: Passing `optimizers` is not allowed if Deepspeed or PyTorch FSDP is enabled. You should subclass `Trainer` and override the `create_optimizer_and_scheduler` method.
[2024-03-11 07:49:12,906] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 2328447) of binary: /root/anaconda3/envs/llmfactory/bin/python3.10
Traceback (most recent call last):
File "/root/anaconda3/envs/llmfactory/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/root/anaconda3/envs/llmfactory/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/root/anaconda3/envs/llmfactory/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1008, in launch_command
deepspeed_launcher(args)
File "/root/anaconda3/envs/llmfactory/lib/python3.10/site-packages/accelerate/commands/launch.py", line 724, in deepspeed_launcher
distrib_run.run(args)
File "/root/anaconda3/envs/llmfactory/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/root/anaconda3/envs/llmfactory/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/anaconda3/envs/llmfactory/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
src/train_bash.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2024-03-11_07:49:12
host : zzlgreat
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 2328448)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
time : 2024-03-11_07:49:12
host : zzlgreat
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 2328450)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-03-11_07:49:12
host : zzlgreat
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 2328447)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Others
No response