peft
                                
                                 peft copied to clipboard
                                
                                    peft copied to clipboard
                            
                            
                            
                        FSDP Dora/QDora Broken
System Info
Package Version
accelerate 0.30.1 aiohttp 3.9.5 aiosignal 1.3.1 annotated-types 0.6.0 async-timeout 4.0.3 attrs 23.2.0 bitsandbytes 0.43.1 certifi 2024.2.2 charset-normalizer 3.3.2 click 8.1.7 datasets 2.19.1 deepspeed 0.14.2+5f631abc dill 0.3.8 docker-pycreds 0.4.0 docstring_parser 0.16 einops 0.8.0 eval_type_backport 0.2.0 exceptiongroup 1.2.1 filelock 3.14.0 flash-attn 2.5.8 frozenlist 1.4.1 fsspec 2024.3.1 gitdb 4.0.11 GitPython 3.1.43 hf_transfer 0.1.6 hjson 3.1.0 huggingface-hub 0.23.0 idna 3.7 iniconfig 2.0.0 Jinja2 3.1.4 markdown-it-py 3.0.0 MarkupSafe 2.1.5 mdurl 0.1.2 mpmath 1.3.0 multidict 6.0.5 multiprocess 0.70.16 networkx 3.1 ninja 1.11.1.1 numpy 1.24.4 nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-cupti-cu12 12.1.105 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu12 8.9.2.26 nvidia-cufft-cu12 11.0.2.54 nvidia-curand-cu12 10.3.2.106 nvidia-cusolver-cu12 11.4.5.107 nvidia-cusparse-cu12 12.1.0.106 nvidia-nccl-cu12 2.20.5 nvidia-nvjitlink-cu12 12.4.127 nvidia-nvtx-cu12 12.1.105 packaging 24.0 pandas 2.0.3 peft 0.11.1.dev0 pillow 10.3.0 pip 24.0 platformdirs 4.2.2 pluggy 1.5.0 protobuf 3.20.1 psutil 5.9.8 py-cpuinfo 9.0.0 pyarrow 16.1.0 pyarrow-hotfix 0.6 pydantic 2.7.1 pydantic_core 2.18.2 Pygments 2.18.0 pynvml 11.5.0 pytest 8.2.0 python-dateutil 2.9.0.post0 pytz 2024.1 PyYAML 6.0.1 regex 2024.5.15 requests 2.31.0 rich 13.7.1 safetensors 0.4.3 scipy 1.10.1 sentencepiece 0.2.0 sentry-sdk 2.2.0 setproctitle 1.3.3 setuptools 69.5.1 shtab 1.7.1 six 1.16.0 smmap 5.0.1 sympy 1.12 text-generation 0.7.0 tokenizers 0.19.1 tomli 2.0.1 torch 2.3.0 torchaudio 2.3.0 torchvision 0.18.0 tqdm 4.66.4 transformers 4.40.2 triton 2.3.0 trl 0.8.6 typing_extensions 4.11.0 tyro 0.8.4 tzdata 2024.1 urllib3 2.2.1 wandb 0.17.0 wheel 0.43.0 xxhash 3.4.1 yarl 1.9.4
I am using two RTX 3090s with Ubuntu 12.2.2 inside of a docker container.
Regular Lora/QLora with FSDP works.
Not sure where this should go. Either PEFT or accelerate I would guess.
I feel like this issue might be related to the following: https://github.com/huggingface/peft/issues/1674 https://github.com/huggingface/accelerate/issues/2761 https://github.com/huggingface/peft/issues/1593#issuecomment-2116202685
Who can help?
@pacman100 @younesbelkada @BenjaminBossan
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [X] An officially supported task in the examplesfolder
- [ ] My own task or dataset (give details below)
Reproduction
- Install the requirements that I have. They are all the latest releases except for PEFT which uses the main branch install due to another recent PR that fixed QLora.
- Try using Dora or QDora. You will notice 1 of 2 types of errors I have found.
- One of those errors is that the Dora model never appears and times out(or you kill it after waiting for 10+ minutes
- Another error is much longer and logs are below.
Both DDP and FSDP work with regular Lora/QLora
Scripts
Working Dora DDP config
compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  gradient_accumulation_steps: 1
  zero3_init_flag: false
  zero_stage: 0
distributed_type: DEEPSPEED
downcast_bf16: 'no'
enable_cpu_affinity: false
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
Broken Dora FSDP config
compute_environment: LOCAL_MACHINE                                                                                                                                           
debug: false                                                                                                                                                                 
distributed_type: FSDP
downcast_bf16: 'no'
fsdp_config:
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch: BACKWARD_PRE
  fsdp_cpu_ram_efficient_loading: true
  fsdp_forward_prefetch: false
  fsdp_offload_params: true
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_state_dict_type: FULL_STATE_DICT
  fsdp_sync_module_states: true
  fsdp_use_orig_params: false
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
Simple Program To Test
You can use this program to see how it is broken. Running on CPU with regular Dora will be much slower, but it will still work.
import torch
from peft import LoraConfig, TaskType,get_peft_model,prepare_model_for_kbit_training
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig,TrainingArguments,AutoConfig
import argparse
from accelerate import Accelerator
parser = argparse.ArgumentParser()
parser.add_argument("-m", "--model_name", type=str, help="model name",default="mistralai/Mistral-7B-v0.1")
parser.add_argument("-cpu", "--cpu", action="store_true", help="use cpu",default=False)
parser.add_argument("-flash", "--flash", action="store_true", help="use flash",default=False)
parser.add_argument("-dora", "--dora", action="store_true", help="use dora",default=False)
parser.add_argument("-int4", "--int4", action="store_true", help="use int4",default=False)
parser.add_argument("-accelerate", "--accelerate", action="store_true", help="use accelerate",default=False)
args = parser.parse_args()
model_name = args.model_name
config_kwargs = {
        "trust_remote_code": True,
    }
config = AutoConfig.from_pretrained(model_name, **config_kwargs)
config.use_cache = False
config.gradient_checkpointing = True
if args.cpu:
    kwargs = {"device_map":None}
elif args.accelerate:
    kwargs = {}
    device_index = Accelerator().process_index
    device_map = {"": device_index}
    kwargs["device_map"] = device_map
else:
    kwargs = {"device_map":"auto"}
if not args.int4:
    bnb_config = None
else:
    bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_storage=torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float32,
)
target_modules = ['up_proj', 'lm_head', 'q_proj', 'gate_proj', 'o_proj', 'k_proj', 'v_proj', 'down_proj']
torch_dtype = torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16
model = AutoModelForCausalLM.from_pretrained(model_name,quantization_config=bnb_config,trust_remote_code=True,torch_dtype=torch_dtype,config=config,attn_implementation="flash_attention_2" if args.flash else None, **kwargs)
peft_config = LoraConfig(
            task_type=TaskType.CAUSAL_LM, inference_mode=False, r=64, lora_alpha=16,lora_dropout=0.1,target_modules=target_modules,modules_to_save=None,use_dora=args.dora
        )
model = get_peft_model(model, peft_config)
Example Uses And Current Results
FSDP Lora
time accelerate launch --config_file accelerate_config_fsdp.yaml test_dora.py  -accelerate -flash
11.85 seconds
FSDP QLora
time accelerate launch --config_file accelerate_config_fsdp.yaml test_dora.py  -accelerate -flash -int4
16.86 seconds
DDP Lora
time accelerate launch --config_file accelerate_config_ddp.yaml test_dora.py  -accelerate -flash
12.84 seconds
DDP QLora
time accelerate launch --config_file accelerate_config_ddp.yaml test_dora.py  -accelerate -flash -int4
12.85 seconds
FSDP Dora
time accelerate launch --config_file accelerate_config_fsdp.yaml test_dora.py -dora  -accelerate -flash
killed after waiting 5+ minutes
FSDP QDora
time accelerate launch --config_file accelerate_config_fsdp.yaml test_dora.py -dora  -accelerate -flash -int4
killed after waiting 5+ minutes
DDP Dora
time accelerate launch --config_file accelerate_config_ddp.yaml test_dora.py -dora  -accelerate -flash
12.83 seconds
DDP QDora
time accelerate launch --config_file accelerate_config_ddp.yaml test_dora.py -dora  -accelerate -flash -int4
12.85 seconds
Regular Lora
time python test_dora.py -flash
6.92
Regular Dora
time python test_dora.py -flash -dora
6.99
Regular QLora
time python test_dora.py -flash  -int4
7.45
Regular QDora
time python test_dora.py -flash -dora -int4
7.52
Regular Lora CPU
time python test_dora.py -flash -cpu
6.886
Regular QLora CPU
time python test_dora.py -flash -cpu -int4
7.16
Regular Dora CPU
time python test_dora.py -flash -cpu -dora
killed after 10+ minutes but I have gotten this working before(or at least I am pretty sure)
Regular QDora CPU
time python test_dora.py -flash -cpu -dora --int4
7.10
Expected behavior
I would expect the same behavior as regular Lora/QLora. That meaning that training successfully occurs and the sample script runs.
[rank0]: Traceback (most recent call last):
[rank0]:   File "trl_finetune.py", line 401, in padding_side not equal to right to the SFTTrainer. This might lead to some unexpected behaviour due to overflow issues when training a model in half-precision. You might consider adding tokenizer.padding_side = 'right' to your code.
warnings.warn(
[rank1]: Traceback (most recent call last):
[rank1]:   File "trl_finetune.py", line 401, in 
The accelerate issue you mentioned sounds very similar. Do you see the same error when using Q-LoRA (i.e. without DoRA)? Could you try downgrading accelerate and check if this resolves the error?
This info would be really useful to have. If it still breaks, but only with DoRA, it could be a DoRA+FSDP issue, possibly related to the use of nn.ParameterDict.
I have no issues using Lora or QLora with FSDP when I install certain versions of the software stack. Naively installing everything from the latest release will not work at this time. Using the software versions I listed above, the sample script I provide, as well as a more complex training program works for these combinations.
I can try downgrading accelerate to 0.29.3 later(when my training with QLora FSDP is finished).
I have tried PEFT from the main branch with the latest release of everything else. This allowed me to train FSDP with Lora/QLora.
Another combination that worked is using the latest released version of PEFT with accelerate 0.29.3. Using the main branch install of PEFT did not fix that as you can see in the other issue.
So the options to get FSDP QLora working are: PEFT main, everything else latest accelerate<=0.29.3 with everything else latest
What I will try: accelerate<=0.29.3 with PEFT main installed and the latest for everything else.
I will share what I find when my system is idle to test these things.
Update: DoRA and QDoRA training with FSDP should be fixed in #1806. If you install from the latest PEFT main, it should thus work. Please also check the PR description on how this was tested.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.