unsloth
unsloth copied to clipboard
Num examples of SFTTrainer decreased to 4862 from 109955(original data)
This is my trial for corpus training in unsloth. model load is the same as the example of unsloth code.
and then I changed r and alpha from default 16 to 64 and added dropout(0.1).
model = FastLanguageModel.get_peft_model(
model,
r = 64,
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",],
lora_alpha = 64,
lora_dropout = 0.1,
bias = "none",
use_gradient_checkpointing = "unsloth",
random_state = 3407,
use_rslora = False,
loftq_config = None,
)
data set(name = combined_dataset) consists of bunch of sentences as you see:
print("Dataset structure:", combined_dataset)
and I used the same code from unsloth example accordingly(train_dataset, dataset_text_field)
EOS_TOKEN = tokenizer.eos_token
def formatting_func(example):
return example["sentence"] + EOS_TOKEN
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported
trainer = SFTTrainer(
model = model,
train_dataset = combined_dataset,
dataset_text_field = "sentence",
tokenizer = tokenizer,
max_seq_length = max_seq_length,
packing = True,
formatting_func = formatting_func,
args = TrainingArguments(
per_device_train_batch_size = 2,
gradient_accumulation_steps = 4,
warmup_ratio = 0.03,
max_grad_norm = 1.0,
num_train_epochs = 1,
learning_rate = 2e-5,
fp16 = not is_bfloat16_supported(),
bf16 = is_bfloat16_supported(),
logging_steps = 1,
optim = "adamw_8bit",
weight_decay = 0.1,
lr_scheduler_type = "cosine",
seed = 3407,
output_dir = "outputs",
),
)
and then when I train trainer_stats = trainer.train(), it shows the Num examples decreased.
but I did not noticed this fact and waited for the result.
8552.5999 seconds used for training.
142.54 minutes used for training.
Peak reserved memory = 11.16 GB.
Peak reserved memory for training = 4.801 GB.
Peak reserved memory % of max memory = 23.477 %.
Peak reserved memory for training % of max memory = 10.1 %.
This is the wandb result you might need.
I cannot clearly say model is well-trained when I try to infer as I intended. As soon as I noticed this num_examples decreased, I tried to re-run all code just in case. However, it shows the same decreased number(4862). Now I am not sure if I did wrong or it is bug or something.
@skmanzg Yes packing = True essentially combines small and long sequences into 1 example, hence it decreases
@skmanzg Yes
packing = Trueessentially combines small and long sequences into 1 example, hence it decreases
Would it be OK to say It trained 109955 data then? One more question, can you link the source or explain how packing works in detail?
@skmanzg https://huggingface.co/docs/trl/en/sft_trainer#packing-dataset--constantlengthdataset-
I would turn it off to see if the results are better
@danielhanchen This is the result without packing.
I had to reduce the size of LoRA and change parameters to avoid oscillate only status. Although It may look less stable than packing one, at least it used all data for each... What do you think of this?
Yes looks fine to me!
probs increase grad accumulation steps to smooth out the loss
increasing grad might smooth out the lose? ok. thank you.
Hmm probs not - i would just inc grad accum
I am using packing = False still getting very less Num Examples :
Map (num_proc=15): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 198460/198460 [05:22<00:00, 615.72 examples/s] ==((====))== Unsloth - 2x faster free finetuning | Num GPUs = 1 \ /| Num examples = 210 | Num Epochs = 3 O^O/ _/ \ Batch size per device = 2 | Gradient Accumulation steps = 4 \ / Total batch size = 8 | Total steps = 78 "-____-" Number of trainable parameters = 20,766,720
trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = dataset,
formatting_func=format_instruction,
max_seq_length = max_seq_length,
dataset_num_proc = 15,
packing = False, # Can make training 5x faster for short sequences.
args = TrainingArguments(
per_device_train_batch_size = 2,
gradient_accumulation_steps = 4,
num_train_epochs = 3,# Set this for 1 full training run.
# num_train_epochs = 5
save_strategy = "steps",
save_steps = 0.05,
learning_rate = 2e-4,
fp16 = not is_bfloat16_supported(),
# bf16 = is_bfloat16_supported(),
bf16 = True,
warmup_steps = 10,
logging_steps = 20,
optim = "adamw_8bit",
weight_decay = 0.01,
lr_scheduler_type = "linear",
seed = 3407,
output_dir = "/clearml_agent_cache/storage_manager/bhupendra_workdir/gemma-2-2b-fintune-dir/checkpoints_gemma2b-2-050824/",
),
)
hey sorry, i fixed. the. problem was with my formatting function. it used to work with batch_size =1 with SFT directly trl.
new formatting function :
def formatting_prompts_func(examples):
texts = []
prompts = examples["prompt"]
outputs = examples["selected_response"]
for prompt,output in zip(prompts,outputs):
text = f"""{prompt}\n\n{output}""" + EOS_TOKEN
texts.append(text)
return { "text" : texts, }
problamatic : Old function :
def format_instruction(sample):
return [f"""{sample['prompt']}\n\n{sample['selected_response']}"""]