metaseq
metaseq copied to clipboard
Fine-tune OPT with my own dataset
❓ Questions and Help
Before asking:
- search the issues.
- search the docs.
What is your question?
I wonder how to fit my own data to OPT training. Thanks a lot
Because of the hugginface integration here: https://huggingface.co/docs/transformers/model_doc/opt
you should be able to train most (not the big ones) of the OPT models like you would train any other huggingface model
What if I wanted to finetune the 6B or 13B models? Huggingface is not optimized and without model parallelism, I'm not sure it would fit on a single GPU (even a 40GB one) (I've had to use model parallelism for GPT 6B neo models). Do you guys have a starting point to for finetuning in your, more optimized, code base? I see the train endpoint which looks like it hooks into Megatron?
same question
Can anyone link me to a google colab or webpage showing how to do this? I am trying to use the Trainer to train opt-350m but am not having any luck doing so.
If anyone else is wondering how to solve the fine-tuning, I have put together some code, based on various internet sources I came across. Unfortunately, I cannot link to references, as I had to hack this together from multiple things, and I only have these scripts now ...
I have had success with both the 125m
and the 350m
models too, with the same settings. Adjust dataset, output directory, train steps, etc. as needed.
I suggest installing the following dependencies before going further:
pip install -q datasets accelerate loralib
pip install -q git+https://github.com/huggingface/transformers.git@main git+https://github.com/huggingface/peft.git
For fine-tuning, I used the following code:
import os
# Hack for my machine, and Windows
os.environ["CUDA_VISIBLE_DEVICES"]="0"
import torch
import torch.nn as nn
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments, DataCollatorForLanguageModeling
from peft import LoraConfig, get_peft_model
from datasets import load_dataset
dialogs_dataset = load_dataset("text", data_files={"train": "train.txt", "test": "test.txt"})
model = AutoModelForCausalLM.from_pretrained("facebook/opt-125m")
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-125m")
for param in model.parameters():
param.requires_grad = False
if param.ndim == 1:
param.data = param.data.to(torch.float32)
model.gradient_checkpointing_enable()
model.enable_input_require_grads()
class CastOutputToFloat(nn.Sequential):
def forward(self, x): return super().forward(x).to(torch.float32)
model.lm_head = CastOutputToFloat(model.lm_head)
config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, config)
train_dataset = dialogs_dataset["train"].map(lambda x: tokenizer(x["text"]), batched=True)
trainer = Trainer(
model=model,
train_dataset=train_dataset,
args=TrainingArguments(
output_dir="opt-125m-fine-tuned",
per_device_train_batch_size=4,
warmup_steps=100,
max_steps=20000,
save_steps=400,
learning_rate=1e-4,
logging_steps=100,
),
data_collator=DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False),
)
model.config.use_cache = False
trainer.train()
model.save_pretrained("opt-125m-fine-tuned/peft-model")
NOTE, that this is LoRA, and not "proper fine-tuning". When I tried to train the model further without LoRA, my loss just would not converge after several thousand steps, so I gave up on that. I have no idea, how the performance of the model differs (or if it differs), compared to "proper fine-tuning".
For inference, I put together this code:
import torch
print(torch.cuda.is_available())
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel, PeftConfig
def load_fine_tuned_model():
model_id = "./opt-125m-fine-tuned/peft-model"
config = PeftConfig.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
config.base_model_name_or_path, return_dict=True, device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)
peft_model = PeftModel.from_pretrained(model, model_id)
return peft_model, tokenizer
model = load_fine_tuned_model()
# Use your model as you would normally do ...
If anyone's wondering, the 20k steps took ~1 hour on an RTX 3090, and my dataset for this specific experiment consisted of ~1500 items. Still, the results were promising.
Could you please tell me what your data looks like?