LLaMA-Factory icon indicating copy to clipboard operation
LLaMA-Factory copied to clipboard

【deepspeed+galore】error in deepspeed with galore

Open zzlgreat opened this issue 3 months ago • 2 comments

Reminder

  • [X] I have read the README and searched the existing issues.

Reproduction

通过accelerate config设置ds zero3的时候,提示以下错误,似乎和ds有某些兼容问题。另外求教galore训练可以多卡训练的的pp/tp方案?非常感谢 deepspeed版本:deepspeed 0.12.5+2ce6bf8c llama-factory已经git pull 至最新版本 训练sh脚本和Log如下所示。

Expected behavior

accelerate launch src/train_bash.py \
    --stage pt \
    --model_name_or_path /DATA4T/text-generation-webui/models/Yi-34B \
    --do_train \
    --dataset wiki_demo \
    --template default \
    --finetuning_type full \
    --use_galore \
    --galore_layerwise \
    --galore_target mlp,self_attn \
    --galore_rank 128 \
    --output_dir /DATA4T/text-generation-webui/loras/deepmoney-2-34b-base-galore \
    --overwrite_cache \
    --overwrite_output_dir \
    --cutoff_len 1024 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --warmup_steps 20 \
    --save_steps 100 \
    --eval_steps 100 \
    --evaluation_strategy steps \
    --load_best_model_at_end \
    --learning_rate 5e-5 \
    --num_train_epochs 3.0 \
    --max_samples 3000 \
    --val_size 0.1 \
    --plot_loss \
    --pure_bf16

System Info

  • num_machines: 1 - rdzv_backend: static - same_network: True - main_training_function: main - deepspeed_config: {'gradient_accumulation_steps': 1, 'gradient_clipping': 1.0, 'offload_optimizer_device': 'none', 'offload_param_device': 'none', 'zero3_init_flag': True, 'zero3_save_16bit_model': True, 'zero_stage': 3} - downcast_bf16: no - tpu_use_cluster: False - tpu_use_sudo: False - tpu_env: []
  • PyTorch version (GPU?): 2.1.2+cu121 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:
03/11/2024 07:49:10 - INFO - llmtuner.train.utils - Using GaLore optimizer, may cause hanging at the start of training, wait patiently.
Traceback (most recent call last):
  File "/DATA4T/LLaMA-Factory/src/train_bash.py", line 14, in <module>
    main()
  File "/DATA4T/LLaMA-Factory/src/train_bash.py", line 5, in main
    run_exp()
  File "/DATA4T/LLaMA-Factory/src/llmtuner/train/tuner.py", line 30, in run_exp
    run_pt(model_args, data_args, training_args, finetuning_args, callbacks)
  File "/DATA4T/LLaMA-Factory/src/llmtuner/train/pt/workflow.py", line 34, in run_pt
    trainer = Trainer(
  File "/root/anaconda3/envs/llmfactory/lib/python3.10/site-packages/transformers/trainer.py", line 527, in __init__
    raise RuntimeError(
RuntimeError: Passing `optimizers` is not allowed if Deepspeed or PyTorch FSDP is enabled. You should subclass `Trainer` and override the `create_optimizer_and_scheduler` method.
03/11/2024 07:49:10 - INFO - llmtuner.train.utils - Using GaLore optimizer, may cause hanging at the start of training, wait patiently.
Traceback (most recent call last):
  File "/DATA4T/LLaMA-Factory/src/train_bash.py", line 14, in <module>
    main()
  File "/DATA4T/LLaMA-Factory/src/train_bash.py", line 5, in main
    run_exp()
  File "/DATA4T/LLaMA-Factory/src/llmtuner/train/tuner.py", line 30, in run_exp
    run_pt(model_args, data_args, training_args, finetuning_args, callbacks)
  File "/DATA4T/LLaMA-Factory/src/llmtuner/train/pt/workflow.py", line 34, in run_pt
    trainer = Trainer(
  File "/root/anaconda3/envs/llmfactory/lib/python3.10/site-packages/transformers/trainer.py", line 527, in __init__
    raise RuntimeError(
RuntimeError: Passing `optimizers` is not allowed if Deepspeed or PyTorch FSDP is enabled. You should subclass `Trainer` and override the `create_optimizer_and_scheduler` method.
03/11/2024 07:49:10 - INFO - llmtuner.train.utils - Using GaLore optimizer, may cause hanging at the start of training, wait patiently.
Traceback (most recent call last):
  File "/DATA4T/LLaMA-Factory/src/train_bash.py", line 14, in <module>
    main()
  File "/DATA4T/LLaMA-Factory/src/train_bash.py", line 5, in main
    run_exp()
  File "/DATA4T/LLaMA-Factory/src/llmtuner/train/tuner.py", line 30, in run_exp
    run_pt(model_args, data_args, training_args, finetuning_args, callbacks)
  File "/DATA4T/LLaMA-Factory/src/llmtuner/train/pt/workflow.py", line 34, in run_pt
    trainer = Trainer(
  File "/root/anaconda3/envs/llmfactory/lib/python3.10/site-packages/transformers/trainer.py", line 527, in __init__
    raise RuntimeError(
RuntimeError: Passing `optimizers` is not allowed if Deepspeed or PyTorch FSDP is enabled. You should subclass `Trainer` and override the `create_optimizer_and_scheduler` method.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████| 7/7 [00:45<00:00,  6.56s/it]
[INFO|modeling_utils.py:3992] 2024-03-11 07:49:10,685 >> All model checkpoint weights were used when initializing LlamaForCausalLM.

[INFO|modeling_utils.py:4000] 2024-03-11 07:49:10,685 >> All the weights of LlamaForCausalLM were initialized from the model checkpoint at /DATA4T/text-generation-webui/models/Yi-34B.
If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training.
[INFO|configuration_utils.py:798] 2024-03-11 07:49:10,689 >> loading configuration file /DATA4T/text-generation-webui/models/Yi-34B/generation_config.json
[INFO|configuration_utils.py:845] 2024-03-11 07:49:10,690 >> Generate config GenerationConfig {
  "bos_token_id": 1,
  "eos_token_id": 2,
  "pad_token_id": 0
}

03/11/2024 07:49:10 - INFO - llmtuner.model.patcher - Gradient checkpointing enabled.
03/11/2024 07:49:10 - INFO - llmtuner.model.adapter - Fine-tuning method: Full
03/11/2024 07:49:10 - INFO - llmtuner.model.loader - trainable params: 34388917248 || all params: 34388917248 || trainable%: 100.0000
/root/anaconda3/envs/llmfactory/lib/python3.10/site-packages/galore_torch/adamw.py:48: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
03/11/2024 07:49:11 - INFO - llmtuner.train.utils - Using GaLore optimizer, may cause hanging at the start of training, wait patiently.
Traceback (most recent call last):
  File "/DATA4T/LLaMA-Factory/src/train_bash.py", line 14, in <module>
    main()
  File "/DATA4T/LLaMA-Factory/src/train_bash.py", line 5, in main
    run_exp()
  File "/DATA4T/LLaMA-Factory/src/llmtuner/train/tuner.py", line 30, in run_exp
    run_pt(model_args, data_args, training_args, finetuning_args, callbacks)
  File "/DATA4T/LLaMA-Factory/src/llmtuner/train/pt/workflow.py", line 34, in run_pt
    trainer = Trainer(
  File "/root/anaconda3/envs/llmfactory/lib/python3.10/site-packages/transformers/trainer.py", line 527, in __init__
    raise RuntimeError(
RuntimeError: Passing `optimizers` is not allowed if Deepspeed or PyTorch FSDP is enabled. You should subclass `Trainer` and override the `create_optimizer_and_scheduler` method.
[2024-03-11 07:49:12,906] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 2328447) of binary: /root/anaconda3/envs/llmfactory/bin/python3.10
Traceback (most recent call last):
  File "/root/anaconda3/envs/llmfactory/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/root/anaconda3/envs/llmfactory/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/root/anaconda3/envs/llmfactory/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1008, in launch_command
    deepspeed_launcher(args)
  File "/root/anaconda3/envs/llmfactory/lib/python3.10/site-packages/accelerate/commands/launch.py", line 724, in deepspeed_launcher
    distrib_run.run(args)
  File "/root/anaconda3/envs/llmfactory/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/root/anaconda3/envs/llmfactory/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/anaconda3/envs/llmfactory/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
src/train_bash.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-03-11_07:49:12
  host      : zzlgreat
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 2328448)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2024-03-11_07:49:12
  host      : zzlgreat
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 2328450)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-03-11_07:49:12
  host      : zzlgreat
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 2328447)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Others

No response

zzlgreat avatar Mar 10 '24 23:03 zzlgreat