metaseq icon indicating copy to clipboard operation
metaseq copied to clipboard

Fine-tune OPT with my own dataset

Open xiaomaiaa opened this issue 2 years ago • 6 comments

❓ Questions and Help

Before asking:

  1. search the issues.
  2. search the docs.

What is your question?

I wonder how to fit my own data to OPT training. Thanks a lot

xiaomaiaa avatar May 18 '22 10:05 xiaomaiaa

Because of the hugginface integration here: https://huggingface.co/docs/transformers/model_doc/opt

you should be able to train most (not the big ones) of the OPT models like you would train any other huggingface model

Skyy93 avatar Jun 02 '22 13:06 Skyy93

What if I wanted to finetune the 6B or 13B models? Huggingface is not optimized and without model parallelism, I'm not sure it would fit on a single GPU (even a 40GB one) (I've had to use model parallelism for GPT 6B neo models). Do you guys have a starting point to for finetuning in your, more optimized, code base? I see the train endpoint which looks like it hooks into Megatron?

lorr1 avatar Jun 08 '22 02:06 lorr1

same question

Dod-o avatar Jul 08 '22 14:07 Dod-o

Can anyone link me to a google colab or webpage showing how to do this? I am trying to use the Trainer to train opt-350m but am not having any luck doing so.

DeepTitan avatar Dec 23 '22 17:12 DeepTitan

If anyone else is wondering how to solve the fine-tuning, I have put together some code, based on various internet sources I came across. Unfortunately, I cannot link to references, as I had to hack this together from multiple things, and I only have these scripts now ...

I have had success with both the 125m and the 350m models too, with the same settings. Adjust dataset, output directory, train steps, etc. as needed.

I suggest installing the following dependencies before going further:

pip install -q datasets accelerate loralib
pip install -q git+https://github.com/huggingface/transformers.git@main git+https://github.com/huggingface/peft.git

For fine-tuning, I used the following code:

import os
# Hack for my machine, and Windows
os.environ["CUDA_VISIBLE_DEVICES"]="0"
import torch
import torch.nn as nn
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments, DataCollatorForLanguageModeling
from peft import LoraConfig, get_peft_model
from datasets import load_dataset

dialogs_dataset = load_dataset("text", data_files={"train": "train.txt", "test": "test.txt"})

model = AutoModelForCausalLM.from_pretrained("facebook/opt-125m")
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-125m")
for param in model.parameters():
    param.requires_grad = False
    if param.ndim == 1:
        param.data = param.data.to(torch.float32)
model.gradient_checkpointing_enable()
model.enable_input_require_grads()
class CastOutputToFloat(nn.Sequential):
  def forward(self, x): return super().forward(x).to(torch.float32)
model.lm_head = CastOutputToFloat(model.lm_head)

config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)

train_dataset = dialogs_dataset["train"].map(lambda x: tokenizer(x["text"]), batched=True)

trainer = Trainer(
    model=model,
    train_dataset=train_dataset,
    args=TrainingArguments(
        output_dir="opt-125m-fine-tuned",
        per_device_train_batch_size=4,
        warmup_steps=100,
        max_steps=20000,
        save_steps=400,
        learning_rate=1e-4,
        logging_steps=100,
    ),
    data_collator=DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False),
)
model.config.use_cache = False
trainer.train()

model.save_pretrained("opt-125m-fine-tuned/peft-model")

NOTE, that this is LoRA, and not "proper fine-tuning". When I tried to train the model further without LoRA, my loss just would not converge after several thousand steps, so I gave up on that. I have no idea, how the performance of the model differs (or if it differs), compared to "proper fine-tuning".

For inference, I put together this code:

import torch
print(torch.cuda.is_available())
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel, PeftConfig

def load_fine_tuned_model():
    model_id = "./opt-125m-fine-tuned/peft-model"
    config = PeftConfig.from_pretrained(model_id)
    model = AutoModelForCausalLM.from_pretrained(
        config.base_model_name_or_path, return_dict=True, device_map="auto"
    )
    tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)
    peft_model = PeftModel.from_pretrained(model, model_id)

    return peft_model, tokenizer

model = load_fine_tuned_model()

# Use your model as you would normally do ...

If anyone's wondering, the 20k steps took ~1 hour on an RTX 3090, and my dataset for this specific experiment consisted of ~1500 items. Still, the results were promising.

bokovhu avatar Jun 21 '23 13:06 bokovhu

Could you please tell me what your data looks like?

Tchagoue avatar Mar 28 '24 15:03 Tchagoue