qlora
qlora copied to clipboard
Loss drop in cero
Hi, i'm training "TheBlock/guanaco-7b-hf" with qlora. i have almost the same script from colab tutorial
import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"
import torch
import wandb
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
model_id = "TheBloke/guanaco-7B-HF"
# model_id = "openlm-research/open_llama_7b_700bt_preview"
# model_id = "openlm-research/open_llama_3b_600bt_preview"
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token_id = (
0 # unk. we want this to be different from the eos token
)
tokenizer.padding_side = "left"
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"":0})
from peft import prepare_model_for_kbit_training
model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)
def print_trainable_parameters(model):
"""
Prints the number of trainable parameters in the model.
"""
trainable_params = 0
all_param = 0
for _, param in model.named_parameters():
all_param += param.numel()
if param.requires_grad:
trainable_params += param.numel()
print(
f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
)
from peft import LoraConfig, get_peft_model
config = LoraConfig(
r=8,
lora_alpha=32,
target_modules=["q_proj","k_proj","v_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, config)
print_trainable_parameters(model)
import transformers
trainer = transformers.Trainer(
model=model,
train_dataset=data["train"],
eval_dataset=data['test'],
args=transformers.TrainingArguments(
per_device_train_batch_size=1,
gradient_accumulation_steps=4,
warmup_steps=2,
max_steps=data["train"].num_rows,
learning_rate=2e-4,
fp16=True,
logging_steps=1,
output_dir="outputs",
optim="paged_adamw_8bit",
# load_best_model_at_end=True,
# evaluation_strategy="steps",
# save_strategy="steps",
save_steps=1000,
),
data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
model.config.use_cache = False # silence the warnings. Please re-enable for inference!
trainer.train()
Is there any error because after a few steps my training loss drops to cero and never goes up againg. My data is the evoling instructions of wizard and my prompt is alpaca.
same issue.
I also have the same problem. The loss first decreases, then it slowly grows till it drops to zero. The image below shows the training loss in the first epoch.
I used llama 7B. Could it be related to the learning rate? I used 3e-4.
I also have the same problem. The loss first decreases, then it slowly grows till it drops to zero. The image below shows the training loss in the first epoch.
I used llama 7B. Could it be related to the learning rate? I used 3e-4.
![]()
I also try this training routine (guanaco 7b qlora)
python qlora.py \
--model_name_or_path huggyllama/llama-7b \
--output_dir ./output/guanaco-7b \
--logging_steps 10 \
--save_strategy steps \
--data_seed 42 \
--save_steps 500 \
--save_total_limit 40 \
--evaluation_strategy steps \
--eval_dataset_size 1024 \
--max_eval_samples 1000 \
--per_device_eval_batch_size 1 \
--max_new_tokens 32 \
--dataloader_num_workers 3 \
--group_by_length \
--logging_strategy steps \
--remove_unused_columns False \
--do_train \
--do_eval \
--do_mmlu_eval \
--lora_r 64 \
--lora_alpha 16 \
--lora_modules all \
--double_quant \
--quant_type nf4 \
--bf16 \
--bits 4 \
--warmup_ratio 0.03 \
--lr_scheduler_type constant \
--gradient_checkpointing \
--dataset oasst1 \
--source_max_len 16 \
--target_max_len 512 \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 16 \
--max_steps 1875 \
--eval_steps 187 \
--learning_rate 0.0002 \
--adam_beta2 0.999 \
--max_grad_norm 0.3 \
--lora_dropout 0.1 \
--weight_decay 0.0 \
--seed 0
i suspect of "lr_scheduler_type" and the lr to. But the funny here is the steps. Why the steps for oasst data are 1875? this is only 20% tops of the epoch?
# optim="paged_adamw_8bit",
try commenting it out.
@JeongChangsu thank you for the suggestion!
I tried to comment on that line but it gave me some strange warnings. But your suggestion helped me find this related issue: #78
I replaced optim="paged_adamw_8bit",
with optim="paged_adamw_32bit",
Now it is running and I will tell you if it works.
Do you happen to know the difference between the two options?
What is the difference between optim="paged_adamw_8bit" and optim="paged_adamw_32bit"?
@JeongChangsu thank you for the suggestion!
I tried to comment on that line but it gave me some strange warnings. But your suggestion helped me find this related issue: #78
I replaced
optim="paged_adamw_8bit",
withoptim="paged_adamw_32bit",
Now it is running and I will tell you if it works.
Do you happen to know the difference between the two options?
does "paged_adamw_32bit" work?
Yes, it worked for me.
Yes, it worked for me.
Many thanks
It doesn't work for me...
@JeongChangsu thank you for the suggestion!
I tried to comment on that line but it gave me some strange warnings. But your suggestion helped me find this related issue: #78
I replaced
optim="paged_adamw_8bit",
withoptim="paged_adamw_32bit",
Now it is running and I will tell you if it works.
Do you happen to know the difference between the two options?
If my understanding is correct, the option 'paged_adamw_8bit' reduces the memory footprint of the optimizer by using lower precision (8-bit) for the internal states. Basically, it aims to save memory in the low-spec training environment.
During the quantization for 8 bits, It might happen due to losing the step computing. so finally it cause the training loss drop to zero.
On the other hand, the option 'paged_adamw_32bit' is a kind of default or normal memory usage setting. But I don't know exactly. It needs more checks.
It doesn't work for me...
It also didn't work for me. Have you figured it out?
@JeongChangsu thank you for the suggestion! I tried to comment on that line but it gave me some strange warnings. But your suggestion helped me find this related issue: #78 I replaced
optim="paged_adamw_8bit",
withoptim="paged_adamw_32bit",
Now it is running and I will tell you if it works. Do you happen to know the difference between the two options?If my understanding is correct, the option 'paged_adamw_8bit' reduces the memory footprint of the optimizer by using lower precision (8-bit) for the internal states. Basically, it aims to save memory in the low-spec training environment.
During the quantization for 8 bits, It might happen due to losing the step computing. so finally it cause the training loss drop to zero.
On the other hand, the option 'paged_adamw_32bit' is a kind of default or normal memory usage setting. But I don't know exactly. It needs more checks.
Do you have any source? Any clue where I can read more on this? I've tried googling the names but no luck
I think it occurs because of fp16=True. I removed fp16=True, and it worked. To speed up, I used tf32 option insteed of fp16. But I don't know why it works. Can someone explain me?
Hi, i'm training "TheBlock/guanaco-7b-hf" with qlora. i have almost the same script from colab tutorial
import os os.environ["CUDA_VISIBLE_DEVICES"]="0" import torch import wandb from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig model_id = "TheBloke/guanaco-7B-HF" # model_id = "openlm-research/open_llama_7b_700bt_preview" # model_id = "openlm-research/open_llama_3b_600bt_preview" bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16 ) tokenizer = AutoTokenizer.from_pretrained(model_id) tokenizer.pad_token_id = ( 0 # unk. we want this to be different from the eos token ) tokenizer.padding_side = "left" model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"":0}) from peft import prepare_model_for_kbit_training model.gradient_checkpointing_enable() model = prepare_model_for_kbit_training(model) def print_trainable_parameters(model): """ Prints the number of trainable parameters in the model. """ trainable_params = 0 all_param = 0 for _, param in model.named_parameters(): all_param += param.numel() if param.requires_grad: trainable_params += param.numel() print( f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}" ) from peft import LoraConfig, get_peft_model config = LoraConfig( r=8, lora_alpha=32, target_modules=["q_proj","k_proj","v_proj"], lora_dropout=0.05, bias="none", task_type="CAUSAL_LM" ) model = get_peft_model(model, config) print_trainable_parameters(model) import transformers trainer = transformers.Trainer( model=model, train_dataset=data["train"], eval_dataset=data['test'], args=transformers.TrainingArguments( per_device_train_batch_size=1, gradient_accumulation_steps=4, warmup_steps=2, max_steps=data["train"].num_rows, learning_rate=2e-4, fp16=True, logging_steps=1, output_dir="outputs", optim="paged_adamw_8bit", # load_best_model_at_end=True, # evaluation_strategy="steps", # save_strategy="steps", save_steps=1000, ), data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False), ) model.config.use_cache = False # silence the warnings. Please re-enable for inference! trainer.train()
Is there any error because after a few steps my training loss drops to cero and never goes up againg. My data is the evoling instructions of wizard and my prompt is alpaca.
Please check this one:
A major bug in 8-bit optimizers that could cause some instabilities later in training has been fixed. Please update bitsandbytes to 0.41.1 via pip install -U bitsandbytes
. Now 8-bit optimizer should again reproduce 32-bit optimizer performance.
Best regards,
Shuyue Nov. 25th, 2023