LLaMA-Factory icon indicating copy to clipboard operation
LLaMA-Factory copied to clipboard

FSDP QLoRa

Open DinhLuan14 opened this issue 1 year ago • 1 comments

Reminder

  • [X] I have read the README and searched the existing issues.

Reproduction

accelerate launch
--config_file examples/accelerate/fsdp_config.yaml src/train.py
--stage sft
--lora_target q_proj,v_proj
--finetuning_type lora
--quantization_bit 4
--do_train
--model_name_or_path /mnt/nfs-data/kilm-storage/public-llm/Meta-Llama-3-70B-Instruct
--dataset instruct_stage_1_max16_line_600_maxlen_306311
--dataset_dir /mnt/nfs-user-data/users/luannd/kiki_creative_project/data
--template llama3
--output_dir /mnt/nfs-user-data/users/luannd/checkpoints/v02_stage_1_max16_line_600_maxlen_306311_llama3_70B
--overwrite_cache
--overwrite_output_dir
--cutoff_len 600
--preprocessing_num_workers 64
--per_device_train_batch_size 16
--per_device_eval_batch_size 8
--gradient_accumulation_steps 1
--lr_scheduler_type cosine
--logging_steps 1
--warmup_ratio 0.05
--learning_rate 3e-5
--num_train_epochs 3
--plot_loss
--bf16
--save_strategy epoch
--eval_steps 100
--evaluation_strategy steps
--val_size 0.02
--report_to "wandb"
--save_total_limit 3

Expected behavior

i try run Llama_3 on 2GPU DGX H100 but found error: AttributeError: type object 'FullyShardedDataParallelPlugin' has no attribute 'get_module_class_from_name'.

System Info

05/04/2024 04:09:29 - INFO - llmtuner.model.utils.checkpointing - Gradient checkpointing enabled. 05/04/2024 04:09:29 - INFO - llmtuner.model.utils.attention - Using torch SDPA for faster training and inference. 05/04/2024 04:09:29 - INFO - llmtuner.model.adapter - Fine-tuning method: LoRA 05/04/2024 04:09:30 - INFO - llmtuner.model.loader - trainable params: 16384000 || all params: 70570090496 || trainable%: 0.0232 [INFO|trainer.py:626] 2024-05-04 04:09:30,039 >> Using auto half precision backend [rank0]: Traceback (most recent call last): [rank0]: File "/mnt/nfs-user-data/users/luannd/LLaMA-Factory/src/train.py", line 17, in [rank0]: main() [rank0]: File "/mnt/nfs-user-data/users/luannd/LLaMA-Factory/src/train.py", line 8, in main [rank0]: run_exp() [rank0]: File "/mnt/nfs-user-data/users/luannd/LLaMA-Factory/src/llmtuner/train/tuner.py", line 33, in run_exp [rank0]: run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks) [rank0]: File "/mnt/nfs-user-data/users/luannd/LLaMA-Factory/src/llmtuner/train/sft/workflow.py", line 73, in run_sft [rank0]: train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint) [rank0]: File "/mnt/nfs-data/miniconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/trainer.py", line 1859, in train [rank0]: return inner_training_loop( [rank0]: File "/mnt/nfs-data/miniconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/trainer.py", line 2001, in _inner_training_loop [rank0]: self._fsdp_qlora_plugin_updates() [rank0]: File "/mnt/nfs-data/miniconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/trainer.py", line 4425, in _fsdp_qlora_plugin_updates [rank0]: fsdp_plugin.auto_wrap_policy = fsdp_auto_wrap_policy(self.model) [rank0]: File "/mnt/nfs-data/miniconda3/envs/llama_factory/lib/python3.10/site-packages/peft/utils/other.py", line 396, in fsdp_auto_wrap_policy [rank0]: transformer_cls = FullyShardedDataParallelPlugin.get_module_class_from_name(model, layer_class) [rank0]: AttributeError: type object 'FullyShardedDataParallelPlugin' has no attribute 'get_module_class_from_name' Loading checkpoint shards: 87%|████████▋ | 26/30 [01:11<00:11, 2.89s/it]W0504 04:09:38.682000 140737350551360 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 4231 closing signal SIGTERM

Others

No response

DinhLuan14 avatar May 04 '24 04:05 DinhLuan14

update transformers to 4.40.1

hiyouga avatar May 04 '24 08:05 hiyouga

i meet the same problem.

transformers==4.40.1 accelerate==0.30.0 bitsandbytes==0.43.1

File "/app/src/llmtuner/train/tuner.py", line 39, in run_exp
run_dpo(model_args, data_args, training_args, finetuning_args, callbacks)
File "/app/src/llmtuner/train/dpo/workflow.py", line 61, in run_dpo
train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1859, in train
return inner_training_loop(
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2001, in _inner_training_loop
self._fsdp_qlora_plugin_updates()
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 4425, in _fsdp_qlora_plugin_updates
fsdp_plugin.auto_wrap_policy = fsdp_auto_wrap_policy(self.model)
File "/usr/local/lib/python3.10/dist-packages/peft/utils/other.py", line 396, in fsdp_auto_wrap_policy
transformer_cls = FullyShardedDataParallelPlugin.get_module_class_from_name(model, layer_class)
AttributeError: type object 'FullyShardedDataParallelPlugin' has no attribute 'get_module_class_from_name'

FangLi1 avatar May 08 '24 06:05 FangLi1

update transformers to 4.40.1

it not work!

FangLi1 avatar May 08 '24 06:05 FangLi1

pip uninstall peft
pip install git+https://github.com/huggingface/peft.git

https://github.com/huggingface/peft/issues/1699#issuecomment-2085179491 https://github.com/huggingface/peft/pull/1694

hiyouga avatar May 08 '24 07:05 hiyouga