trl Can we use SFTTrainer for pre-training?

Hi,

I can't find any document talking about how to use TRL for pre-training.

Can we use SFTTrainer to do pre-training? I mean, I can collect corpus and split them into chunks, and save those chunks as rows of training dataset (in text field). Then just go with the quickstart guide of SFTTrainer

dataset = load_dataset("imdb", split="train")

## check the dataset
print(dataset)
Dataset({
    features: ['text', 'label'],
    num_rows: 25000
})

print(dataset[0])
{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far between, even then it\'s not shot like some cheaply made porno. While my countrymen mind find it shocking, in reality sex and nudity are a major staple in Swedish cinema. Even Ingmar Bergman, arguably their answer to good old boy John Ford, had sex scenes in his films.<br /><br />I do commend the filmmakers for the fact that any sex shown in the film is shown for artistic purposes rather than just to shock people and make money to be shown in pornographic theaters in America. I AM CURIOUS-YELLOW is a good film for anyone wanting to study the meat and potatoes (no pun intended) of Swedish cinema. But really, this film doesn\'t have much of a plot.',
 'label': 0}

## training

trainer = SFTTrainer(
    "facebook/opt-350m",
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=512,
)

trainer.train()

Will this work??

May 23 '24 07:05 wennycooper

Up! Similar question here 🙌🏻 cc : @younesbelkada if you got some time please 🤗

May 26 '24 15:05 alielfilali01

Hi there! Yes you could definitely use SFTTrainer for pre-training, but you would need enough GPU RAM. Let's assume you want to pre-train a Llama model using SFTTrainer, you would first need to:

1- Create a LlamaConfig:

from transformers import LlamaConfig

llama_config = LlamaConfig(num_hidden_layers=4, hidden_size=1024)

2- Initialize your random model:

from transformers import LlamaForCausalLM

model = LlamaForCausalLM(config)

3- Pass that model to SFTTrainer

I would use a high learning rate. In case you face into CUDA OOM issues, you could use GaLore optimizer: https://huggingface.co/docs/transformers/main/en/trainer#galore

A complete example script below:

import torch
import datasets
import trl

from transformers import TrainingArguments, AutoConfig, AutoTokenizer, AutoModelForCausalLM

train_dataset = datasets.load_dataset('imdb', split='train')

args = TrainingArguments(
    output_dir="./test-galore",
    max_steps=100,
    per_device_train_batch_size=2,
    optim="galore_adamw",
    optim_target_modules=["attn", "mlp"]
)

model_id = "google/gemma-2b"

config = AutoConfig.from_pretrained(model_id)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_config(config).to(0)

trainer = trl.SFTTrainer(
    model=model, 
    args=args,
    train_dataset=train_dataset,
    dataset_text_field='text',
    max_seq_length=512,
)

trainer.train()

May 28 '24 18:05 younesbelkada

And how this will differ from using the Trainer instead ? like is there any fundamental difference in the implementation itself ? Tnx for answering our questions, this have been really insightful for me at least 🤗

May 29 '24 00:05 alielfilali01

Hi @alielfilali01 Thanks a lot ! SFTTrainer is just a user-friendly wrapper around Trainer, if you are more familiar with Trainer, you could do the same with Trainer as well

May 30 '24 14:05 younesbelkada

Thank you so much dear @younesbelkada, that's been immensely helpful 🤗

May 30 '24 19:05 alielfilali01

I have a question - will he be able to answer questions like "what is the capital of the USA?" If I do pretraining rather than fine tuning?

Jun 06 '24 12:06 KlausikPL

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Jun 30 '24 15:06 github-actions[bot]