unsloth
unsloth copied to clipboard
After LoRa training (or loading the checkpoint) consecutive inference gives different results even if do_sample is False
Hi there,
I noticed another critical bug (at least from mine point of view): after LoRa training and even with do_sample is False, consecutive inference results in different results:
Loading the base model:
from unsloth import FastLanguageModel
import torch
model_name = "unsloth/mistral-7b-instruct-v0.2-bnb-4bit"
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = model_name,
max_seq_length = 2048,
dtype = None,
load_in_4bit = True,
# token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)
tokenizer.padding_side='left' # for training right (inference left)
tokenizer.pad_token = tokenizer.eos_token
Setting up LoRa
from unsloth import FastLanguageModel
model = FastLanguageModel.get_peft_model(
model,
r = 32
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
lora_alpha = 32,
lora_dropout = 0.05,
bias = "none",
use_gradient_checkpointing = True,
use_rslora = False,
loftq_config = None,
)
model.print_trainable_parameters()
Training:
from transformers import TrainingArguments
from trl import SFTTrainer
trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = dataset,
dataset_text_field = "text",
max_seq_length = 2048,
dataset_num_proc = 6,
packing = False,
args = TrainingArguments(
per_device_train_batch_size = 4,
gradient_accumulation_steps = 1,
warmup_steps = 5,
max_steps = 10000,
learning_rate = lr,
fp16 = not torch.cuda.is_bf16_supported(),
bf16 = torch.cuda.is_bf16_supported(),
save_steps=save_steps,
logging_steps=logging_steps,
optim = "adamw_8bit",
logging_dir=f'logs/{output_dir}_{get_date_time()}',
weight_decay = 0.005,
lr_scheduler_type = "linear",
output_dir = output_dir,
report_to="tensorboard"
),
)
trainer_stats = trainer.train()
I trained for 10K and than infer:
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer([txt], return_tensors = "pt").to("cuda")
response = model.generate(**inputs, max_new_tokens = max_new_tokens, do_sample=False).cpu().numpy()
token_ids_list = response.squeeze().tolist()
text = tokenizer.decode(token_ids_list, skip_special_tokens=True)
Even though do_sample is False responses are different (even if I reload the checkpoint)
But if i save to model:
model.save_pretrained_merged("lora", tokenizer, save_method = "lora",)
and than load it:
from unsloth import FastLanguageModel
import torch
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "lora",
max_seq_length = 2048,
dtype = None,
load_in_4bit = True,
# token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)
all outputs are consistently the same:
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer([txt], return_tensors = "pt").to("cuda")
response = model.generate(**inputs, max_new_tokens = max_new_tokens, do_sample=False).cpu().numpy()
token_ids_list = response.squeeze().tolist()
text = tokenizer.decode(token_ids_list, skip_special_tokens=True)
Is this single batched inference?
if you mean a single element in the batch than yes it is, since txt variable is just a string.
@ziemowit-s Maybe I might have solved it, but unsure with yesterday's patch