peft icon indicating copy to clipboard operation
peft copied to clipboard

FSDP Dora/QDora Broken

Open mallorbc opened this issue 1 year ago • 3 comments

System Info

Package Version


accelerate 0.30.1 aiohttp 3.9.5 aiosignal 1.3.1 annotated-types 0.6.0 async-timeout 4.0.3 attrs 23.2.0 bitsandbytes 0.43.1 certifi 2024.2.2 charset-normalizer 3.3.2 click 8.1.7 datasets 2.19.1 deepspeed 0.14.2+5f631abc dill 0.3.8 docker-pycreds 0.4.0 docstring_parser 0.16 einops 0.8.0 eval_type_backport 0.2.0 exceptiongroup 1.2.1 filelock 3.14.0 flash-attn 2.5.8 frozenlist 1.4.1 fsspec 2024.3.1 gitdb 4.0.11 GitPython 3.1.43 hf_transfer 0.1.6 hjson 3.1.0 huggingface-hub 0.23.0 idna 3.7 iniconfig 2.0.0 Jinja2 3.1.4 markdown-it-py 3.0.0 MarkupSafe 2.1.5 mdurl 0.1.2 mpmath 1.3.0 multidict 6.0.5 multiprocess 0.70.16 networkx 3.1 ninja 1.11.1.1 numpy 1.24.4 nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-cupti-cu12 12.1.105 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu12 8.9.2.26 nvidia-cufft-cu12 11.0.2.54 nvidia-curand-cu12 10.3.2.106 nvidia-cusolver-cu12 11.4.5.107 nvidia-cusparse-cu12 12.1.0.106 nvidia-nccl-cu12 2.20.5 nvidia-nvjitlink-cu12 12.4.127 nvidia-nvtx-cu12 12.1.105 packaging 24.0 pandas 2.0.3 peft 0.11.1.dev0 pillow 10.3.0 pip 24.0 platformdirs 4.2.2 pluggy 1.5.0 protobuf 3.20.1 psutil 5.9.8 py-cpuinfo 9.0.0 pyarrow 16.1.0 pyarrow-hotfix 0.6 pydantic 2.7.1 pydantic_core 2.18.2 Pygments 2.18.0 pynvml 11.5.0 pytest 8.2.0 python-dateutil 2.9.0.post0 pytz 2024.1 PyYAML 6.0.1 regex 2024.5.15 requests 2.31.0 rich 13.7.1 safetensors 0.4.3 scipy 1.10.1 sentencepiece 0.2.0 sentry-sdk 2.2.0 setproctitle 1.3.3 setuptools 69.5.1 shtab 1.7.1 six 1.16.0 smmap 5.0.1 sympy 1.12 text-generation 0.7.0 tokenizers 0.19.1 tomli 2.0.1 torch 2.3.0 torchaudio 2.3.0 torchvision 0.18.0 tqdm 4.66.4 transformers 4.40.2 triton 2.3.0 trl 0.8.6 typing_extensions 4.11.0 tyro 0.8.4 tzdata 2024.1 urllib3 2.2.1 wandb 0.17.0 wheel 0.43.0 xxhash 3.4.1 yarl 1.9.4

I am using two RTX 3090s with Ubuntu 12.2.2 inside of a docker container.

Regular Lora/QLora with FSDP works.

Not sure where this should go. Either PEFT or accelerate I would guess.

I feel like this issue might be related to the following: https://github.com/huggingface/peft/issues/1674 https://github.com/huggingface/accelerate/issues/2761 https://github.com/huggingface/peft/issues/1593#issuecomment-2116202685

Who can help?

@pacman100 @younesbelkada @BenjaminBossan

Information

  • [ ] The official example scripts
  • [X] My own modified scripts

Tasks

  • [X] An officially supported task in the examples folder
  • [ ] My own task or dataset (give details below)

Reproduction

  1. Install the requirements that I have. They are all the latest releases except for PEFT which uses the main branch install due to another recent PR that fixed QLora.
  2. Try using Dora or QDora. You will notice 1 of 2 types of errors I have found.
  3. One of those errors is that the Dora model never appears and times out(or you kill it after waiting for 10+ minutes
  4. Another error is much longer and logs are below.

Both DDP and FSDP work with regular Lora/QLora

Scripts

Working Dora DDP config

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  gradient_accumulation_steps: 1
  zero3_init_flag: false
  zero_stage: 0
distributed_type: DEEPSPEED
downcast_bf16: 'no'
enable_cpu_affinity: false
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Broken Dora FSDP config

compute_environment: LOCAL_MACHINE                                                                                                                                           
debug: false                                                                                                                                                                 
distributed_type: FSDP
downcast_bf16: 'no'
fsdp_config:
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch: BACKWARD_PRE
  fsdp_cpu_ram_efficient_loading: true
  fsdp_forward_prefetch: false
  fsdp_offload_params: true
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_state_dict_type: FULL_STATE_DICT
  fsdp_sync_module_states: true
  fsdp_use_orig_params: false
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Simple Program To Test

You can use this program to see how it is broken. Running on CPU with regular Dora will be much slower, but it will still work.

import torch
from peft import LoraConfig, TaskType,get_peft_model,prepare_model_for_kbit_training
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig,TrainingArguments,AutoConfig
import argparse
from accelerate import Accelerator

parser = argparse.ArgumentParser()
parser.add_argument("-m", "--model_name", type=str, help="model name",default="mistralai/Mistral-7B-v0.1")
parser.add_argument("-cpu", "--cpu", action="store_true", help="use cpu",default=False)
parser.add_argument("-flash", "--flash", action="store_true", help="use flash",default=False)
parser.add_argument("-dora", "--dora", action="store_true", help="use dora",default=False)
parser.add_argument("-int4", "--int4", action="store_true", help="use int4",default=False)
parser.add_argument("-accelerate", "--accelerate", action="store_true", help="use accelerate",default=False)
args = parser.parse_args()
model_name = args.model_name

config_kwargs = {
        "trust_remote_code": True,
    }
config = AutoConfig.from_pretrained(model_name, **config_kwargs)
config.use_cache = False
config.gradient_checkpointing = True
if args.cpu:
    kwargs = {"device_map":None}
elif args.accelerate:
    kwargs = {}
    device_index = Accelerator().process_index
    device_map = {"": device_index}
    kwargs["device_map"] = device_map
else:
    kwargs = {"device_map":"auto"}
if not args.int4:
    bnb_config = None
else:
    bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_storage=torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float32,
)
target_modules = ['up_proj', 'lm_head', 'q_proj', 'gate_proj', 'o_proj', 'k_proj', 'v_proj', 'down_proj']
torch_dtype = torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16

model = AutoModelForCausalLM.from_pretrained(model_name,quantization_config=bnb_config,trust_remote_code=True,torch_dtype=torch_dtype,config=config,attn_implementation="flash_attention_2" if args.flash else None, **kwargs)

peft_config = LoraConfig(
            task_type=TaskType.CAUSAL_LM, inference_mode=False, r=64, lora_alpha=16,lora_dropout=0.1,target_modules=target_modules,modules_to_save=None,use_dora=args.dora
        )
model = get_peft_model(model, peft_config)

Example Uses And Current Results

FSDP Lora time accelerate launch --config_file accelerate_config_fsdp.yaml test_dora.py -accelerate -flash 11.85 seconds FSDP QLora time accelerate launch --config_file accelerate_config_fsdp.yaml test_dora.py -accelerate -flash -int4 16.86 seconds DDP Lora time accelerate launch --config_file accelerate_config_ddp.yaml test_dora.py -accelerate -flash 12.84 seconds DDP QLora time accelerate launch --config_file accelerate_config_ddp.yaml test_dora.py -accelerate -flash -int4 12.85 seconds

FSDP Dora time accelerate launch --config_file accelerate_config_fsdp.yaml test_dora.py -dora -accelerate -flash killed after waiting 5+ minutes FSDP QDora time accelerate launch --config_file accelerate_config_fsdp.yaml test_dora.py -dora -accelerate -flash -int4 killed after waiting 5+ minutes DDP Dora time accelerate launch --config_file accelerate_config_ddp.yaml test_dora.py -dora -accelerate -flash 12.83 seconds DDP QDora time accelerate launch --config_file accelerate_config_ddp.yaml test_dora.py -dora -accelerate -flash -int4 12.85 seconds

Regular Lora time python test_dora.py -flash 6.92 Regular Dora time python test_dora.py -flash -dora 6.99 Regular QLora time python test_dora.py -flash -int4 7.45 Regular QDora time python test_dora.py -flash -dora -int4 7.52 Regular Lora CPU time python test_dora.py -flash -cpu 6.886 Regular QLora CPU time python test_dora.py -flash -cpu -int4 7.16 Regular Dora CPU time python test_dora.py -flash -cpu -dora killed after 10+ minutes but I have gotten this working before(or at least I am pretty sure) Regular QDora CPU time python test_dora.py -flash -cpu -dora --int4 7.10

Expected behavior

I would expect the same behavior as regular Lora/QLora. That meaning that training successfully occurs and the sample script runs.

mallorbc avatar May 17 '24 00:05 mallorbc

[rank0]: Traceback (most recent call last): [rank0]: File "trl_finetune.py", line 401, in [rank0]: trainer.train(resume_from_checkpoint=args.resume_from_checkpoint) [rank0]: File "/usr/local/lib/python3.8/dist-packages/trl/trainer/sft_trainer.py", line 361, in train [rank0]: output = super().train(*args, **kwargs) [rank0]: File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1859, in train [rank0]: return inner_training_loop( [rank0]: File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 2002, in _inner_training_loop [rank0]: self.model = self.accelerator.prepare(self.model) [rank0]: File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1292, in prepare [rank0]: result = tuple( [rank0]: File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1293, in [rank0]: self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement) [rank0]: File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1169, in _prepare_one [rank0]: return self.prepare_model(obj, device_placement=device_placement) [rank0]: File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1459, in prepare_model [rank0]: model = FSDP(model, **kwargs) [rank0]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 485, in init [rank0]: _auto_wrap( [rank0]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_wrap_utils.py", line 101, in _auto_wrap [rank0]: _recursive_wrap(**recursive_wrap_kwargs, **root_kwargs) # type: ignore[arg-type] [rank0]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap [rank0]: wrapped_child, num_wrapped_params = _recursive_wrap( [rank0]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap [rank0]: wrapped_child, num_wrapped_params = _recursive_wrap( [rank0]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap [rank0]: wrapped_child, num_wrapped_params = _recursive_wrap( [rank0]: [Previous line repeated 2 more times] [rank0]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/wrap.py", line 561, in _recursive_wrap [rank0]: return _wrap(module, wrapper_cls, **kwargs), nonwrapped_numel [rank0]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/wrap.py", line 490, in _wrap [rank0]: return wrapper_cls(module, **kwargs) [rank0]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 511, in init [rank0]: _init_param_handle_from_module( [rank0]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_init_utils.py", line 598, in _init_param_handle_from_module [rank0]: _init_param_handle_from_params(state, managed_params, fully_sharded_module) [rank0]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_init_utils.py", line 610, in _init_param_handle_from_params [rank0]: handle = FlatParamHandle( [rank0]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_flat_param.py", line 582, in init [rank0]: self._init_flat_param_and_metadata( [rank0]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_flat_param.py", line 632, in _init_flat_param_and_metadata [rank0]: ) = self._validate_tensors_to_flatten(params) [rank0]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_flat_param.py", line 770, in _validate_tensors_to_flatten [rank0]: raise ValueError( [rank0]: ValueError: Must flatten tensors with uniform dtype but got torch.bfloat16 and torch.float32 Map: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20201/20201 [00:01<00:00, 14172.58 examples/s] Map: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3541/3541 [00:00<00:00, 14188.14 examples/s] /usr/local/lib/python3.8/dist-packages/trl/trainer/sft_trainer.py:318: UserWarning: You passed a tokenizer with padding_side not equal to right to the SFTTrainer. This might lead to some unexpected behaviour due to overflow issues when training a model in half-precision. You might consider adding tokenizer.padding_side = 'right' to your code. warnings.warn( [rank1]: Traceback (most recent call last): [rank1]: File "trl_finetune.py", line 401, in [rank1]: trainer.train(resume_from_checkpoint=args.resume_from_checkpoint) [rank1]: File "/usr/local/lib/python3.8/dist-packages/trl/trainer/sft_trainer.py", line 361, in train [rank1]: output = super().train(*args, **kwargs) [rank1]: File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1859, in train [rank1]: return inner_training_loop( [rank1]: File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 2002, in _inner_training_loop [rank1]: self.model = self.accelerator.prepare(self.model) [rank1]: File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1292, in prepare [rank1]: result = tuple( [rank1]: File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1293, in [rank1]: self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement) [rank1]: File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1169, in _prepare_one [rank1]: return self.prepare_model(obj, device_placement=device_placement) [rank1]: File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1459, in prepare_model [rank1]: model = FSDP(model, **kwargs) [rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 485, in init [rank1]: _auto_wrap( [rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_wrap_utils.py", line 101, in _auto_wrap [rank1]: _recursive_wrap(**recursive_wrap_kwargs, **root_kwargs) # type: ignore[arg-type] [rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap [rank1]: wrapped_child, num_wrapped_params = _recursive_wrap( [rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap [rank1]: wrapped_child, num_wrapped_params = _recursive_wrap( [rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap [rank1]: wrapped_child, num_wrapped_params = _recursive_wrap( [rank1]: [Previous line repeated 2 more times] [rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/wrap.py", line 561, in _recursive_wrap [rank1]: return _wrap(module, wrapper_cls, **kwargs), nonwrapped_numel [rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/wrap.py", line 490, in _wrap [rank1]: return wrapper_cls(module, **kwargs) [rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 511, in init [rank1]: _init_param_handle_from_module( [rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_init_utils.py", line 598, in _init_param_handle_from_module [rank1]: _init_param_handle_from_params(state, managed_params, fully_sharded_module) [rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_init_utils.py", line 610, in _init_param_handle_from_params [rank1]: handle = FlatParamHandle( [rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_flat_param.py", line 582, in init [rank1]: self._init_flat_param_and_metadata( [rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_flat_param.py", line 632, in _init_flat_param_and_metadata [rank1]: ) = self._validate_tensors_to_flatten(params) [rank1]: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_flat_param.py", line 770, in _validate_tensors_to_flatten [rank1]: raise ValueError( [rank1]: ValueError: Must flatten tensors with uniform dtype but got torch.bfloat16 and torch.float32

mallorbc avatar May 17 '24 01:05 mallorbc

The accelerate issue you mentioned sounds very similar. Do you see the same error when using Q-LoRA (i.e. without DoRA)? Could you try downgrading accelerate and check if this resolves the error?

This info would be really useful to have. If it still breaks, but only with DoRA, it could be a DoRA+FSDP issue, possibly related to the use of nn.ParameterDict.

BenjaminBossan avatar May 17 '24 11:05 BenjaminBossan

I have no issues using Lora or QLora with FSDP when I install certain versions of the software stack. Naively installing everything from the latest release will not work at this time. Using the software versions I listed above, the sample script I provide, as well as a more complex training program works for these combinations.

I can try downgrading accelerate to 0.29.3 later(when my training with QLora FSDP is finished).

I have tried PEFT from the main branch with the latest release of everything else. This allowed me to train FSDP with Lora/QLora.

Another combination that worked is using the latest released version of PEFT with accelerate 0.29.3. Using the main branch install of PEFT did not fix that as you can see in the other issue.

So the options to get FSDP QLora working are: PEFT main, everything else latest accelerate<=0.29.3 with everything else latest

What I will try: accelerate<=0.29.3 with PEFT main installed and the latest for everything else.

I will share what I find when my system is idle to test these things.

mallorbc avatar May 17 '24 19:05 mallorbc

Update: DoRA and QDoRA training with FSDP should be fixed in #1806. If you install from the latest PEFT main, it should thus work. Please also check the PR description on how this was tested.

BenjaminBossan avatar May 31 '24 14:05 BenjaminBossan

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

github-actions[bot] avatar Jun 24 '24 15:06 github-actions[bot]