unsloth Are there any guidelines for loading a CPT (continued pre-training) model and retraining it on a different data set?

Can I load a model trained by unsloth's CPT (continued Pre-Training) method, change only the saved Lora parameters to learnable parameters, and then proceed with CPT on a different data set? In other words, I want to continue the Lora parameters of the model trained by CPT on a different data set. Are there any reference documents or guidelines? If I run the code below to continue CPT on a different data set, won't the Lora layers be created overlapping? I want to use the Lora layers created in the previous CPT step as they are.



model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "my_cpt_model", 
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    
)
model = FastLanguageModel.get_peft_model(
    model,
    r = 128, 
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",

                      "embed_tokens", "lm_head",], 
pretraining
    lora_alpha = 32,
    lora_dropout = 0, 
    bias = "none",    
    
    use_gradient_checkpointing = "unsloth", 
    random_state = 3407,
    use_rslora = True,   
    loftq_config = None, 
)

Oct 02 '24 08:10 daegonYu

@daegonYu Yes that should work (I think) - The continued pretraining notebook does train on the same LoRA adapters twice - https://colab.research.google.com/drive/1tEd1FrOXWMnCU9UIvdYhs61tkxdMuKZu?usp=sharing so it should function (hopefully)

Oct 03 '24 09:10 danielhanchen

If I load the Lora model from outside and train it with UnslothTrainer without get_peft_model(), I can train it with the previously generated Lora parameters. Thank you for your answer.

Oct 04 '24 01:10 daegonYu

Additionally, I have a question. When learning a decoder model, I understand that when the Instruction part is input to the model, only the Response part is learned by calculating the loss, but in the colab you suggested, it is entered as learning data without such distinction. Can the model learn effectively even if it is learned like this? Also, can you tell me about a blog or paper that includes an explanation of this?

Oct 04 '24 01:10 daegonYu

@daegonYu You might be interested in our conversational notebook which masks out the instruction - https://colab.research.google.com/drive/1T5-zKWM_5OD21QHwXHiV9ixTRR7k3iB9?usp=sharing

Also see https://github.com/unslothai/unsloth/wiki#train-on-completions--responses-only-do-not-train-on-inputs

Oct 05 '24 08:10 danielhanchen

Oh this is what I was looking for. thank you!

Oct 05 '24 09:10 daegonYu

One thing I'm wondering about while researching this is, is it okay to assume that using DataCollatorForCompletionOnlyLM(response_template, tokenizer=tokenizer) and using DataCollatorForSeq2Seq(tokenizer = tokenizer) with train_on_responses_only( trainer, #instruction_part = "<|start_header_id|>user<|end_header_id|>\n\n", response_part = "<|start_header_id|>assistant<|end_header_id|>\n\n", ) will have the same effect?

Here's a more detailed example code.


from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from trl import SFTConfig, SFTTrainer, DataCollatorForCompletionOnlyLM

dataset = load_dataset("lucasmccabe-lmi/CodeAlpaca-20k", split="train")

model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m")
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m")

def formatting_prompts_func(example):
    output_texts = []
    for i in range(len(example['instruction'])):
        text = f"### Question: {example['instruction'][i]}\n ### Answer: {example['output'][i]}"
        output_texts.append(text)
    return output_texts

response_template = " ### Answer:"
collator = DataCollatorForCompletionOnlyLM(response_template, tokenizer=tokenizer)

trainer = SFTTrainer(
    model,
    train_dataset=dataset,
    args=SFTConfig(output_dir="/tmp"),
    formatting_func=formatting_prompts_func,
    data_collator=collator,
)


from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        # num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
    trainer,
  #instruction_part = "<|start_header_id|>user<|end_header_id|>\n\n",
    response_part = "<|start_header_id|>assistant<|end_header_id|>\n\n",
)

Oct 14 '24 13:10 daegonYu

@daegonYu Sorry on the delay! Yes they're equivalent EXCEPT if you're doing more than 1 conversation. HF's one does not support it, whilst Unsloth does.

Oct 18 '24 08:10 danielhanchen

max_seq_length

May I ask why during the DPO training , when initialize model from the SFT model, as follows: model, tokenizer = FastLanguageModel.from_pretrained( model_name = f"{args.ckpt_name}", max_seq_length = max_seq_length,
max_seq_length = 4096? But in SFT trainer, this arg is 2048, what is the relation between max_seq_length and the args used in intilizing the DPO trainer, eg, max_length, max_prompt_length = prompt_length ;

Oct 24 '24 09:10 Candice1995

@Candice1995 Apologies on the delay - DPO has a prompt, then 2 other fields - the accepted or rejected answer to the prompt. These fields have varying lengths, and so we have to truncate or specify the lengths for each. Unsloth's max_seq_length is the total maximum sum length of all the fields

Oct 27 '24 09:10 danielhanchen

unsloth unsloth copied to clipboard

Are there any guidelines for loading a CPT (continued pre-training) model and retraining it on a different data set?

unsloth
unsloth copied to clipboard