alpaca-lora icon indicating copy to clipboard operation
alpaca-lora copied to clipboard

AssertionError: No inf checks were recorded for this optimizer.

Open erjieyong opened this issue 1 year ago • 3 comments

First of all, a great thank you for sharing this model to the world!!!

Anyway, i've been trying to train my own model based off of this repo.

My objective of this training was to made use of unsupervised training dataset to get the model to understand how words are written in my domain (basically masked language modelling). Reason i don't use the conventional instructional fine tuning is because there's no such dataset of sufficient quantity available to me.

The 2 main changes i've made are as follows

  1. instead of fine tuning from Llama's weights, i'll finetune from a existing alpaca-lora's weight. As such, i've edited the code as follows
from peft import (
    # LoraConfig,
    PeftModel,
    get_peft_model,
    get_peft_model_state_dict,
    prepare_model_for_int8_training,
    set_peft_model_state_dict,
)

as well as

# config = LoraConfig(
    #     r=lora_r,
    #     lora_alpha=lora_alpha,
    #     target_modules=lora_target_modules,
    #     lora_dropout=lora_dropout,
    #     bias="none",
    #     task_type="CAUSAL_LM",
    # )
    # model = get_peft_model(model, config)

    # replace with this to load directly from alpaca
    LORA_WEIGHTS = "tloen/alpaca-lora-7b"
    model = PeftModel.from_pretrained(
        model,
        LORA_WEIGHTS,
        torch_dtype=torch.float16,
    )
  1. edited the dataset to my own. (I am not using the prompt template). My code for generating the dataset as follows:
def chunk_text(data):
    concantenated_text = ''
    all_result = []
    for i in range(data['train'].num_rows):
        concantenated_text += data['train']['combined'][i]
    tokenized_concantenated_text = tokenizer.encode(concantenated_text)[1:]
    tokenized_prompt = tokenizer.encode("### Text: ")[1:]
    full_length = len(tokenized_concantenated_text)
    for i in range(0, full_length, chunk_size):
        text = tokenized_concantenated_text[i: i+chunk_size+overlap_size] 
        text = tokenized_prompt + text
        text = tokenizer.decode(text)
        
        result = tokenizer(text, padding=False)
        if result["input_ids"][-1] != tokenizer.eos_token_id:
            result["input_ids"].append(tokenizer.eos_token_id)
            result["attention_mask"].append(1)

        result["labels"] = result["input_ids"].copy()

        all_result.append(result)
    return all_result

However, i keep facing the following error no matter how i tweak the code. Really appreciate any help rendered!

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ in <cell line: 2>:2                                                                              │
│                                                                                                  │
│ /usr/local/lib/python3.9/dist-packages/transformers/trainer.py:1662 in train                     │
│                                                                                                  │
│   1659 │   │   inner_training_loop = find_executable_batch_size(                                 │
│   1660 │   │   │   self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size  │
│   1661 │   │   )                                                                                 │
│ ❱ 1662 │   │   return inner_training_loop(                                                       │
│   1663 │   │   │   args=args,                                                                    │
│   1664 │   │   │   resume_from_checkpoint=resume_from_checkpoint,                                │
│   1665 │   │   │   trial=trial,                                                                  │
│                                                                                                  │
│ /usr/local/lib/python3.9/dist-packages/transformers/trainer.py:1991 in _inner_training_loop      │
│                                                                                                  │
│   1988 │   │   │   │   │   │   │   xm.optimizer_step(self.optimizer)                             │
│   1989 │   │   │   │   │   elif self.do_grad_scaling:                                            │
│   1990 │   │   │   │   │   │   scale_before = self.scaler.get_scale()                            │
│ ❱ 1991 │   │   │   │   │   │   self.scaler.step(self.optimizer)                                  │
│   1992 │   │   │   │   │   │   self.scaler.update()                                              │
│   1993 │   │   │   │   │   │   scale_after = self.scaler.get_scale()                             │
│   1994 │   │   │   │   │   │   optimizer_was_run = scale_before <= scale_after                   │
│                                                                                                  │
│ /usr/local/lib/python3.9/dist-packages/torch/cuda/amp/grad_scaler.py:368 in step                 │
│                                                                                                  │
│   365 │   │   if optimizer_state["stage"] is OptState.READY:                                     │
│   366 │   │   │   self.unscale_(optimizer)                                                       │
│   367 │   │                                                                                      │
│ ❱ 368 │   │   assert len(optimizer_state["found_inf_per_device"]) > 0, "No inf checks were rec   │
│   369 │   │                                                                                      │
│   370 │   │   retval = self._maybe_opt_step(optimizer, optimizer_state, *args, **kwargs)         │
│   371                                                                                            │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
AssertionError: No inf checks were recorded for this optimizer.

Environment Python: 3.9 cuda: 11.8

erjieyong avatar Apr 09 '23 07:04 erjieyong

@erjieyong did you find a solution? facing the same errror

nishantb06 avatar Apr 19 '23 16:04 nishantb06

@erjieyong tried different versions of cuda, same issue.

codebymike avatar Apr 19 '23 21:04 codebymike

Hey all, I've managed to find an alternative in the end.

First of all, I suspect the error might be due to the weights not being allowed for tuning which resulted in the inf error.

I was able to overcome this by using the resume_from_checkpoint function directly built into the alpaca lora's finetune.py[finetune.py](https://github.com/tloen/alpaca-lora/blob/8bb8579e403dc78e37fe81ffbb253c413007323f/finetune.py#L191)

To be more specific, pass in the path of the existing adapter that you want to further fine tune to the resume_from_checkpoint argument when calling finetune.py python finetune.py
--base_model='decapoda-research/llama-7b-hf'
--num_epochs=10
--cutoff_len=512
--group_by_length
--output_dir='./lora-alpaca'
--lora_target_modules='[q_proj,k_proj,v_proj,o_proj]'
--lora_r=16
--micro_batch_size=8 --resume_from_checkpoint='./alpaca-lora'

erjieyong avatar Apr 20 '23 09:04 erjieyong