peft icon indicating copy to clipboard operation
peft copied to clipboard

Different versions seem to have an impact on the results

Open passby111 opened this issue 1 year ago • 23 comments

Hi! I tried to do lora with llama2 model. However, when I use peft==0.9.0, loss is always NAN.0, when I use peft==0.3.0, loss is normal. I'm curious if there are significant differences in Lora between different versions?

passby111 avatar Mar 03 '24 12:03 passby111

Hi @passby111 can you share more about your experiment? What is your training setup? Also do you face the same issue with other PEFT versions e.g. 0.8.2?

younesbelkada avatar Mar 04 '24 01:03 younesbelkada

Thank you for response! I am trying to train llama with multimodal data.

torch == 2.1.0+cu118
torchvision == 0.16.0+cu118
transformers  == 4.30.2
self.llama_model = LlamaForCausalLM.from_pretrained(
                args.llama_model,
                torch_dtype=torch.float16,
            )
peft_config = LoraConfig(
                task_type=TaskType.CAUSAL_LM, inference_mode=False, r=args.llm_r, lora_alpha=args.llm_alpha, lora_dropout=args.lora_dropout)
            self.llama_model = get_peft_model(self.llama_model, peft_config)
....
outputs = self.llama_model(
            inputs_embeds=inputs_embeds,
            attention_mask=attention_mask,
            return_dict=True,
            labels=targets,
        )
loss = outputs.loss

When I use peft == 0.9.0, loss is always NAN, when I use peft == 0.3.0, at the beginning of the training, the loss was normal, but after 1000 iterations, NAN also began to appear, alternating with normal. I also try peft == 0.8.2, the loss is always NAN, On anther machine with torch == 1.12.1+cu113 torchvision = 0.13.1+cu113,transformers == 4.30.2,peft==0.3.0, the loss is always normal.

passby111 avatar Mar 04 '24 02:03 passby111

Thanks for the details. Unfortunately, this is not enough for us to pinpoint the exact issue could be. I have 2 suggestions:

  • Take the usual steps to mitigate NAN losses, e.g. trying to lower the learning rate.
  • Check the PEFT versions in more detail, i.e. all the other versions between 0.3 and 0.8.2. Which version is the one where it starts breaking? Knowing this might help us identify the root cause.

BenjaminBossan avatar Mar 04 '24 12:03 BenjaminBossan

Same problem here, on PEFT == 0.9 loss is NAN, while on peft <=0.7 loss is normal

skylooop avatar Mar 19 '24 14:03 skylooop

@skylooop Without the full code and data, we cannot start debugging this issue. If you cannot share, can you identify the exact version of PEFT at which training starts to break? Is it independent of hyper-parameters?

BenjaminBossan avatar Mar 19 '24 16:03 BenjaminBossan

@skylooop Without the full code and data, we cannot start debugging this issue. If you cannot share, can you identify the exact version of PEFT at which training starts to break? Is it independent of hyper-parameters?

I tried to run fine-tuning on llama-2-7b from SliceGPT repo https://github.com/microsoft/TransformerCompression/tree/main without any changes to code or author's default parameters. Im using torch==2.2.1 with latest version of peft==0.9 and observed that loss instantly goes to 0 with grad_norm = nan after first iteration. Then I tried other versions of peft and found that it works on peft < 0.7. I checked this behaviour on two different machines with V100 and RTX3090 cards with bf16/fp32 regimes. Lowering learning rate/changing optimizer or scheduler did not solve the problem.

So I am not sure whether it is a problem with implementation of SliceGPT or peft

skylooop avatar Mar 19 '24 22:03 skylooop

Thanks for the pointer. I tried replicating your issue by using the fine-tuning script but ran out of memory. I tried lowering some hyper-params but still no luck, and when I try to switch to a smaller model, I get an error each time that it's not supported. However, I saw that the repo fixed PEFT at v0.6.0, so I would use that version to be sure.

If you or someone else can provide a full reproducer for us to investigate, it would be great.

BenjaminBossan avatar Mar 20 '24 10:03 BenjaminBossan

@passby111 Hi, have you solved this problem yet?

I found the same problem when trying to peft fine-tune CodeLLama-7B (using LlamaForSequenceClassification), the Loss is always 0 during the fine-tuning.

Thanks!

sssszh avatar Apr 03 '24 14:04 sssszh

I found the same problem when trying to peft fine-tune CodeLLama-7B (using LlamaForSequenceClassification), the Loss is always 0 during the fine-tuning.

Is that also with the MS compression library? Do you also see that training works with v0.6.0 but not the current version? If your situation is not exactly the same, please open a new issue instead and give us as many details as possible so that we can try to reproduce the error.

BenjaminBossan avatar Apr 03 '24 14:04 BenjaminBossan

I found the same problem when trying to peft fine-tune CodeLLama-7B (using LlamaForSequenceClassification), the Loss is always 0 during the fine-tuning.

Is that also with the MS compression library? Do you also see that training works with v0.6.0 but not the current version? If your situation is not exactly the same, please open a new issue instead and give us as many details as possible so that we can try to reproduce the error.

I seem to have found the problem. I was originally trying to fine-tune CodeLlama on the classification task, where the Lora parameters were set as follows:

peft_config = LoraConfig(
    task_type=TaskType.SEQ_CLS, inference_mode=False, r=8, lora_alpha=32, lora_dropout=0.1, target_modules = ["q_proj", "v_proj"]
)

In this case, no matter which version of peft I use, I have the problem that the loss stays at 0 after performing a gradient update.

But I changed the TaskType in the Lora parameter to:

peft_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM, inference_mode=False, r=8, lora_alpha=32, lora_dropout=0.1, target_modules = ["q_proj", "v_proj"]
)

Using peft=0.6.0 will not cause this problem, while using peft>=0.7.0,will still result in a loss of 0.

sssszh avatar Apr 04 '24 07:04 sssszh

Thanks for providing more details. I did a quick check on the diff between v0.6.0 and v0.7.0 but at first glance, nothing came up that could explain the difference that you see. Please, if you could provide more information, ideally the whole code to replicate the issue, we can probably help, but with only a tiny snippet, it's very hard.

BenjaminBossan avatar Apr 04 '24 09:04 BenjaminBossan

@BenjaminBossan Thanks for your reply. This is the full code and the data can be downloaded here: https://drive.google.com/drive/folders/1gaZ-pRb07XMMwSnbpBAyjUsm0_08VNrt?usp=drive_link

transformers == 4.39.0
peft == 0.6.0
import os
os.environ["CUDA_VISIBLE_DEVICES"] = '1'
os.environ["TOKENIZERS_PARALLELISM"] = "false"
import pprint
import json
from tqdm import tqdm
import transformers
from transformers import EvalPrediction, WEIGHTS_NAME
from transformers import Trainer
import torch
from peft import get_peft_model, LoraConfig, TaskType
from transformers import LlamaForSequenceClassification
import torch.multiprocessing
torch.multiprocessing.set_sharing_strategy('file_system')

peft_config = LoraConfig(
    task_type=TaskType.SEQ_CLS, inference_mode=False, r=8, lora_alpha=32, lora_dropout=0.1, target_modules = ["q_proj", "v_proj"]
)

class CodeLLamaBaseDataset(torch.utils.data.Dataset):
    def __init__(self, dataroot, problems, model, max_tokens) -> None:
        super().__init__()
        self.dataroot = dataroot
        self.problems = problems 
        self.model = model
        self.max_tokens = max_tokens
        self.samples = []           
        self.initialize()
        print("===================================================================================")
        print("load tokenizer:", model)

        self.tokenizer = transformers.CodeLlamaTokenizer.from_pretrained(model)
        self.tokenizer.pad_token = self.tokenizer.eos_token

    def initialize(self):

        all_samples = []

        print(f"Loading {len(self.problems)} problems from {self.dataroot}.")

        for idx, line in tqdm(enumerate(self.problems), ncols=0, total=len(self.problems)):
            json_line = json.loads(line)
            code = ' '.join(json_line['func'].split())
            target = json_line["target"]
            sample = (code, target)
            all_samples.append(sample)

        print(f"Loaded {len(all_samples)} samples from {self.dataroot}.")
        self.samples = all_samples

    def __len__(self):
        return len(self.samples)

    def __getitem__(self, idx):
        
        inputs = self.pack_samples(idx)
        return inputs
    
    def pack_samples(self, idx):

        sample_pool = self.samples
        code, target = sample_pool[idx]
        source_ids = self.tokenizer.encode(code, max_length=self.max_tokens, padding='max_length', truncation=True)
        attention_ids = torch.tensor(source_ids)
        attention_mask = attention_ids.ne(self.tokenizer.pad_token_id)

        out_sample = {
            "input_ids": torch.tensor(source_ids),
            "attention_mask": attention_mask,
            "labels": torch.tensor(target)
        }

        return out_sample

def run_training(args, train_data, val_data):

    model_path = args.model_path if args.model_path is not None else '{}'.format(args.model)
    print("Loading model from {}...".format(model_path))
    tokenizer = transformers.CodeLlamaTokenizer.from_pretrained(model_path)
    tokenizer.pad_token = tokenizer.eos_token
    model = LlamaForSequenceClassification.from_pretrained(model_path, device_map="auto", torch_dtype=torch.float16, pad_token_id=tokenizer.eos_token_id)
    model = get_peft_model(model, peft_config)
    model.print_trainable_parameters()
    
    print('Finished loading model {}'.format(args.model))

    start_iteration = 0
    train_data.start_iteration = start_iteration
    print(f"Starting main loop")

    training_args = transformers.TrainingArguments(
        output_dir=args.save_dir,
        overwrite_output_dir=True, 
        
        do_train=True,
        do_eval=True,
        do_predict=False,
        save_strategy='steps',
        evaluation_strategy='steps',
        eval_steps=682, 

        num_train_epochs=args.epochs,
        per_device_train_batch_size=args.batch_size_per_replica,
        per_device_eval_batch_size=4,
        gradient_accumulation_steps=args.grad_acc_steps,

        learning_rate=args.lr,
        weight_decay=0.05,
        warmup_steps=682,
        lr_scheduler_type='linear',

        logging_dir=args.save_dir, 
        logging_first_step=True,
        logging_steps=args.log_freq,
        save_steps=args.save_freq,
        save_total_limit=args.save_total_limit,
        seed=args.seed,
        dataloader_drop_last=True,
        dataloader_num_workers=8,

        local_rank=args.local_rank,

        deepspeed=args.deepspeed,
        fp16=args.fp16,
        
    )
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_data,
        eval_dataset=val_data,
    )
    trainer.train()

    if args.local_rank == 0:
        model.save_pretrained(os.path.join(args.save_dir, "final_checkpoint"))

def get_dataset(args, mode="train"): 
    
    if mode == "train": 
        dataroot = args.train_path
        with open(args.train_path, 'r') as f:
            problems_1 = f.readlines()
    elif mode == "val":
        dataroot = args.val_path
        with open(args.val_path, 'r') as f:
            problems_1 = f.readlines()
    
    if args.db and mode == "train":
        problems_1 = problems_1[:640]
    elif args.db and mode == "val":
        problems_1 = problems_1[:50]
    
    train_data = CodeLLamaBaseDataset(
        dataroot=dataroot,
        problems=problems_1,
        model=args.model,
        max_tokens=1536,
    )

    return train_data

def main(args):

    argsdict = vars(args)
    print(pprint.pformat(argsdict))

    os.makedirs(args.save_dir, exist_ok=True)
    
    # Load dataset 
    train_data = get_dataset(args, "train")
    val_data = get_dataset(args, "val")

    # Save args to file
    json.dump(argsdict, open(os.path.join(args.save_dir, "args.json"), 'w'))

    # Load and train model; save model checkpoints 
    run_training(args, train_data, val_data)


if __name__ == "__main__":
    
    import argparse

    parser = argparse.ArgumentParser(description="Training a model.")
    parser.add_argument('--model', default="codellama/CodeLlama-7b-hf", type=str, help='type of transformers model as model backbone')
    parser.add_argument('--model_path', default="codellama/CodeLlama-7b-hf", type=str, help='path to model backbone pretrained weights') 
    parser.add_argument('--save_dir', default='./outputs', type=str, help='path to save trained model checkpoints') 

    # Dataloading
    parser.add_argument('--train_path', default="./data/train.jsonl", type=str, help='path to training data')
    parser.add_argument('--val_path', default="./data/valid.jsonl", type=str, help='path to training data')
    parser.add_argument('--test_path', default="./data/test.jsonl", type=str, help='path to training data')

    # Model
    parser.add_argument('--clone_head', default=False, action='store_true', help='Optional: clone a seperate linear layer for RL samples and initialize it from finetuned LM head')
    parser.add_argument('--num_labels', default=2, type=int, help="")
    # Training
    parser.add_argument('--epochs', default=30, type=int, help='total number of training epochs')
    parser.add_argument('--lr', default=2e-5, type=float, help='training learning rate')
    parser.add_argument('--batch-size-per-replica', default=2, type=int, help='batch size per GPU')
    parser.add_argument('--grad-acc-steps', default=16, type=int, help='number of training steps before each gradient update')
    parser.add_argument('--deepspeed', default = None, type=str, help='path to deepspeed configuration file; set None if not using deepspeed')
    parser.add_argument('--fp16', default=False, action='store_true', help='set 16-bit training to reduce memory usage')
    parser.add_argument('--local_rank', default=-1, type=int)
    parser.add_argument('--db', default=False, action='store_true', help='set to turn on debug mode i.e. using dummy small data split and only 1 data worker')
    parser.add_argument('--seed', default=123456, type=int)
    # Logging
    parser.add_argument('--log-freq', default=5, type=int, help='save training log after this number of training steps')
    parser.add_argument('--save-freq', default=682, type=int, help='save model checkpoints after this number of training steps')
    parser.add_argument('--save_total_limit', default=30, type=int, help='total of number checkpoints to keep; only keep the latest ones') 

    args = parser.parse_args()

    main(args)

When I select task_type=TaskType.SEQ_CLS the loss is always nan after a gradient update, this does not happen when selecting task_type=TaskType.CAUSAL_LM but selecting task_type=TaskType.CAUSAL_LM doesn't save the the weight of "score".

I would appreciate if you could provide some help!

sssszh avatar Apr 08 '24 04:04 sssszh

@sssszh Thanks for providing the script. I didn't have access to your data, so I requested it. Would this also work with another dataset, like one of the datasets on HF Hub? Ideally, I would prefer to use those instead.

BenjaminBossan avatar Apr 08 '24 12:04 BenjaminBossan

@sssszh Thanks for providing the script. I didn't have access to your data, so I requested it. Would this also work with another dataset, like one of the datasets on HF Hub? Ideally, I would prefer to use those instead.

This dataset is also available on huggingface: claudios/code_x_glue_devign, but you may need to modify the code in the Dataset section, because the format of the data is not the same, e.g., target's True maps to label:0, False maps to label:1.

Thanks!

sssszh avatar Apr 08 '24 12:04 sssszh

This dataset is also available on huggingface: claudios/code_x_glue_devign, but you may need to modify the code in the Dataset section, because the format of the data is not the same, e.g., target's True maps to label:0, False maps to label:1.

I tried this dataset but ran into a couple of issues (different from your suggestion). So instead of losing more time trying to figure this out, could you grant me access to the dataset that you linked?

BenjaminBossan avatar Apr 08 '24 14:04 BenjaminBossan

This dataset is also available on huggingface: claudios/code_x_glue_devign, but you may need to modify the code in the Dataset section, because the format of the data is not the same, e.g., target's True maps to label:0, False maps to label:1.

I tried this dataset but ran into a couple of issues (different from your suggestion). So instead of losing more time trying to figure this out, could you grant me access to the dataset that you linked?

Sorry, it should be accessible now: https://drive.google.com/drive/folders/1gaZ-pRb07XMMwSnbpBAyjUsm0_08VNrt?usp=sharing

sssszh avatar Apr 09 '24 00:04 sssszh

Thanks for providing access, I could finally run the example. First of all, I had to make a few small modifications to the script because of OOM errors, but I don't think that those changes make a difference for the issue at hand:

2c2
< os.environ["CUDA_VISIBLE_DEVICES"] = '1'
---
> os.environ["CUDA_VISIBLE_DEVICES"] = '0'
81c81
<     model = LlamaForSequenceClassification.from_pretrained(model_path, device_map="auto", torch_dtype=torch.float16, pad_token_id=tokenizer.eos_token_id)
---
>     model = LlamaForSequenceClassification.from_pretrained(model_path, torch_dtype=torch.float16, pad_token_id=tokenizer.eos_token_id).to(0)
83a84,85
>     # this resolves the issue:
>     # model.base_model.model.score = model.base_model.model.score.original_module
158c160
<         max_tokens=1536,
---
>         max_tokens=128,
164d165
< 
201c202
<     parser.add_argument('--batch-size-per-replica', default=2, type=int, help='batch size per GPU')
---
>     parser.add_argument('--batch-size-per-replica', default=1, type=int, help='batch size per GPU')

This allowed me to replicate the issue of nan loss. Let's break down the problem further:

Using TaskType.SEQ_CLS

Some further research showed that there appears to be a problem with modules_to_save. When we use TaskType.SEQ_CLS, it means that we automatically add the score layer to modules_to_save to fully fine-tune it. When I manually remove it again (see my comment in the diff), the losses don't show nan. Therefore, this specific issue is somehow related to modules_to_save. As you mentioned, this is independent of PEFT version, I tested 0.5.0, 0.6.0, 0.6.1, 0.6.2, and 0.7.0. PyTorch version is 2.2.2 and transformers is 4.39.3.

Using TaskType.CAUSAL_LM

So this is a separate issue, because CAUSAL_LM does not involve modules_to_save and only appears in 0.7.0. I ran a git bisect and could narrow down to this commit: #1106. Unfortunately, this is a pretty big commit (a large refactor), so narrowing down further what caused the issue is going to be very hard.

I tried a few other things, just leaving my observations here, even if they are inconclusive:

  • IA³ works even after the refactor
  • LoRA, LoKr, and LoHa all don't work
  • bfloat16 instead of float16 works

I thought maybe it's the specific Llama layer type, but the same issue occurs when setting target_modules=["score"], so it's most likely not Llama-specific. Still, I think that some Llama-specific code could be problematic for LoRA, e.g. these lines where the layer weight is used directly:

https://github.com/huggingface/transformers/blob/08a194fcd615dcf9406a7e319d637cc303097f46/src/transformers/models/llama/modeling_llama.py#L335-L339

These lines completely side-step the LoRA weights and probably won't lead to the correct result (ping @younesbelkada in case he knows more about this). But again, this does not seem to be the source of the problems here.

This issue seems to be similar to #1568 too. However, I cannot imagine that LoRA training with float16 is completely broken since v0.7.0, I'm sure we would have gotten many more issues and that case. This leaves me scratching my head what the cause could be, I'll investigate further when I have time. In the meantime, if you have the option to use bfloat16 instead of float16, please try this out.

BenjaminBossan avatar Apr 09 '24 15:04 BenjaminBossan

@BenjaminBossan Thank you very much for your reply!

I found that the source of the problem may be the use of torch_dtype=torch.float16. when using torch_dtype=torch.float16, no matter what the configuration of LoRA is:

peft_config = LoraConfig(
    task_type=TaskType.SEQ_CLS, inference_mode=False, r=8, lora_alpha=32, lora_dropout=0.1, target_modules = ["q_proj", "v_proj"]
)

or

peft_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM, inference_mode=False, r=8, lora_alpha=32, lora_dropout=0.1, target_modules = ["q_proj", "v_proj"], modules_to_save=["score"] 
)

Both have the problem that the loss goes to 0 after performing a gradient update.

So the problem may be in the score layer, I found out by tuning that the parameter weights in the score layer before doing backward() are: image But after a backward(), the parameter weights of the score layer become: image

I think that's what's causing the loss to be 0, but I'm not sure why using torch_dtype=torch.float16 causes this problem for models with a score layer (I've tried other models, and other models also have this problem). But when the score layer is not involved in training, i.e. using TaskType.CAUSAL_LM and not specifying modules_to_save=["score"], torch_dtype=torch.float16 doesn't cause this problem.

One solution:

This problem can be solved when I don't use torch_dtype=torch.float16. I changed the parameter configuration of LoRA to:

from peft import prepare_model_for_kbit_training
from transformers import LlamaForSequenceClassification, BitsAndBytesConfig

peft_config = LoraConfig(
    task_type=TaskType.SEQ_CLS, inference_mode=False, r=8, lora_alpha=32, lora_dropout=0.1, target_modules = ['v_proj', 'down_proj', 'up_proj', 'q_proj', 'gate_proj', 'k_proj', 'o_proj']
)

q_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type="nf4",
   bnb_4bit_use_double_quant=True,
   bnb_4bit_compute_dtype=torch.bfloat16
)

model = LlamaForSequenceClassification.from_pretrained(model_path, quantization_config=q_config, device_map="auto", pad_token_id=tokenizer.eos_token_id)
model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)
model = get_peft_model(model, peft_config)

This will allow me to successfully train the model.

Thanks again for your reply!

sssszh avatar Apr 10 '24 01:04 sssszh

I think that's what's causing the loss to be 0, but I'm not sure why using torch_dtype=torch.float16 causes this problem for models with a score layer (I've tried other models, and other models also have this problem). But when the score layer is not involved in training, i.e. using TaskType.CAUSAL_LM and not specifying modules_to_save=["score"], torch_dtype=torch.float16 doesn't cause this problem.

I can replicate the problem even without that, i.e. no modules_to_save at all. So unfortunately, this is not the (complete) solution.

This problem can be solved when I don't use torch_dtype=torch.float16.

Great that this works for you. @passby111 could you also check if bfloat16 instead of float16 solves the issue for you?

BenjaminBossan avatar Apr 10 '24 13:04 BenjaminBossan

One more thing (for those users who cannot use bfloat16): This should also fix the issue, even if using float16:

...
model = get_peft_model(...)
# convert all peft parameters to float32
for param in model.parameters():
    if param.requires_grad:
        param.data = param.data.float()

BenjaminBossan avatar Apr 10 '24 15:04 BenjaminBossan

I also have this issue with latest TRL and transformers versions for Llama finetuning in pure float16 training. Downgrading to 0.6.2 solves the issue.

matthieu-zimmer avatar Apr 17 '24 10:04 matthieu-zimmer

@matthieu-zimmer Did you try one of the suggested solutions above?

BenjaminBossan avatar Apr 17 '24 12:04 BenjaminBossan

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

github-actions[bot] avatar May 11 '24 15:05 github-actions[bot]

@matthieu-zimmer Did you try one of the suggested solutions above?

param.data = param.data.float() is working for mixed precision yes, but not to train in pure float16.

matthieu-zimmer avatar May 22 '24 12:05 matthieu-zimmer