FSDP QLoRa
Reminder
- [X] I have read the README and searched the existing issues.
Reproduction
accelerate launch
--config_file examples/accelerate/fsdp_config.yaml src/train.py
--stage sft
--lora_target q_proj,v_proj
--finetuning_type lora
--quantization_bit 4
--do_train
--model_name_or_path /mnt/nfs-data/kilm-storage/public-llm/Meta-Llama-3-70B-Instruct
--dataset instruct_stage_1_max16_line_600_maxlen_306311
--dataset_dir /mnt/nfs-user-data/users/luannd/kiki_creative_project/data
--template llama3
--output_dir /mnt/nfs-user-data/users/luannd/checkpoints/v02_stage_1_max16_line_600_maxlen_306311_llama3_70B
--overwrite_cache
--overwrite_output_dir
--cutoff_len 600
--preprocessing_num_workers 64
--per_device_train_batch_size 16
--per_device_eval_batch_size 8
--gradient_accumulation_steps 1
--lr_scheduler_type cosine
--logging_steps 1
--warmup_ratio 0.05
--learning_rate 3e-5
--num_train_epochs 3
--plot_loss
--bf16
--save_strategy epoch
--eval_steps 100
--evaluation_strategy steps
--val_size 0.02
--report_to "wandb"
--save_total_limit 3
Expected behavior
i try run Llama_3 on 2GPU DGX H100 but found error: AttributeError: type object 'FullyShardedDataParallelPlugin' has no attribute 'get_module_class_from_name'.
System Info
05/04/2024 04:09:29 - INFO - llmtuner.model.utils.checkpointing - Gradient checkpointing enabled.
05/04/2024 04:09:29 - INFO - llmtuner.model.utils.attention - Using torch SDPA for faster training and inference.
05/04/2024 04:09:29 - INFO - llmtuner.model.adapter - Fine-tuning method: LoRA
05/04/2024 04:09:30 - INFO - llmtuner.model.loader - trainable params: 16384000 || all params: 70570090496 || trainable%: 0.0232
[INFO|trainer.py:626] 2024-05-04 04:09:30,039 >> Using auto half precision backend
[rank0]: Traceback (most recent call last):
[rank0]: File "/mnt/nfs-user-data/users/luannd/LLaMA-Factory/src/train.py", line 17, in
Others
No response
update transformers to 4.40.1
i meet the same problem.
transformers==4.40.1 accelerate==0.30.0 bitsandbytes==0.43.1
File "/app/src/llmtuner/train/tuner.py", line 39, in run_exp
run_dpo(model_args, data_args, training_args, finetuning_args, callbacks)
File "/app/src/llmtuner/train/dpo/workflow.py", line 61, in run_dpo
train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1859, in train
return inner_training_loop(
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2001, in _inner_training_loop
self._fsdp_qlora_plugin_updates()
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 4425, in _fsdp_qlora_plugin_updates
fsdp_plugin.auto_wrap_policy = fsdp_auto_wrap_policy(self.model)
File "/usr/local/lib/python3.10/dist-packages/peft/utils/other.py", line 396, in fsdp_auto_wrap_policy
transformer_cls = FullyShardedDataParallelPlugin.get_module_class_from_name(model, layer_class)
AttributeError: type object 'FullyShardedDataParallelPlugin' has no attribute 'get_module_class_from_name'
update transformers to 4.40.1
it not work!
pip uninstall peft
pip install git+https://github.com/huggingface/peft.git
https://github.com/huggingface/peft/issues/1699#issuecomment-2085179491 https://github.com/huggingface/peft/pull/1694