qlora Loss drop in cero

Hi, i'm training "TheBlock/guanaco-7b-hf" with qlora. i have almost the same script from colab tutorial

import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"

import torch
import wandb
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

model_id = "TheBloke/guanaco-7B-HF"
# model_id = "openlm-research/open_llama_7b_700bt_preview"
# model_id = "openlm-research/open_llama_3b_600bt_preview"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token_id = (
    0  # unk. we want this to be different from the eos token
)
tokenizer.padding_side = "left"
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"":0})

from peft import prepare_model_for_kbit_training

model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=8, 
    lora_alpha=32, 
    target_modules=["q_proj","k_proj","v_proj"], 
    lora_dropout=0.05, 
    bias="none", 
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)
print_trainable_parameters(model)

import transformers

trainer = transformers.Trainer(
    model=model,
    train_dataset=data["train"],
    eval_dataset=data['test'],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_steps=2,
        max_steps=data["train"].num_rows,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=1,
        output_dir="outputs",
        optim="paged_adamw_8bit",
        # load_best_model_at_end=True,
        # evaluation_strategy="steps",
        # save_strategy="steps",
        save_steps=1000,
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()

Is there any error because after a few steps my training loss drops to cero and never goes up againg. My data is the evoling instructions of wizard and my prompt is alpaca.

Jun 02 '23 15:06 jocastrocUnal

same issue.

Jun 04 '23 08:06 jwj7140

I also have the same problem. The loss first decreases, then it slowly grows till it drops to zero. The image below shows the training loss in the first epoch.

I used llama 7B. Could it be related to the learning rate? I used 3e-4.

Jun 04 '23 08:06 crux82

I also have the same problem. The loss first decreases, then it slowly grows till it drops to zero. The image below shows the training loss in the first epoch.

I used llama 7B. Could it be related to the learning rate? I used 3e-4.

I also try this training routine (guanaco 7b qlora)

python qlora.py \
    --model_name_or_path huggyllama/llama-7b \
    --output_dir ./output/guanaco-7b \
    --logging_steps 10 \
    --save_strategy steps \
    --data_seed 42 \
    --save_steps 500 \
    --save_total_limit 40 \
    --evaluation_strategy steps \
    --eval_dataset_size 1024 \
    --max_eval_samples 1000 \
    --per_device_eval_batch_size 1 \
    --max_new_tokens 32 \
    --dataloader_num_workers 3 \
    --group_by_length \
    --logging_strategy steps \
    --remove_unused_columns False \
    --do_train \
    --do_eval \
    --do_mmlu_eval \
    --lora_r 64 \
    --lora_alpha 16 \
    --lora_modules all \
    --double_quant \
    --quant_type nf4 \
    --bf16 \
    --bits 4 \
    --warmup_ratio 0.03 \
    --lr_scheduler_type constant \
    --gradient_checkpointing \
    --dataset oasst1 \
    --source_max_len 16 \
    --target_max_len 512 \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 16 \
    --max_steps 1875 \
    --eval_steps 187 \
    --learning_rate 0.0002 \
    --adam_beta2 0.999 \
    --max_grad_norm 0.3 \
    --lora_dropout 0.1 \
    --weight_decay 0.0 \
    --seed 0

i suspect of "lr_scheduler_type" and the lr to. But the funny here is the steps. Why the steps for oasst data are 1875? this is only 20% tops of the epoch?

Jun 04 '23 13:06 jocastrocUnal

# optim="paged_adamw_8bit", try commenting it out.

Jun 09 '23 08:06 JeongChangsu

@JeongChangsu thank you for the suggestion!

I tried to comment on that line but it gave me some strange warnings. But your suggestion helped me find this related issue: #78

I replaced optim="paged_adamw_8bit",with optim="paged_adamw_32bit",

Now it is running and I will tell you if it works.

Do you happen to know the difference between the two options?

Jun 09 '23 21:06 crux82

What is the difference between optim="paged_adamw_8bit" and optim="paged_adamw_32bit"?

Jun 26 '23 15:06 ChuanMeng

@JeongChangsu thank you for the suggestion!

I tried to comment on that line but it gave me some strange warnings. But your suggestion helped me find this related issue: #78

I replaced optim="paged_adamw_8bit",with optim="paged_adamw_32bit",

Now it is running and I will tell you if it works.

Do you happen to know the difference between the two options?

does "paged_adamw_32bit" work?

Jul 18 '23 01:07 ngavcc

Yes, it worked for me.

Jul 18 '23 08:07 crux82

Yes, it worked for me.

Many thanks

Jul 18 '23 08:07 ngavcc

It doesn't work for me...

Aug 02 '23 03:08 michelle-chou25

@JeongChangsu thank you for the suggestion!

I tried to comment on that line but it gave me some strange warnings. But your suggestion helped me find this related issue: #78

I replaced optim="paged_adamw_8bit",with optim="paged_adamw_32bit",

Now it is running and I will tell you if it works.

Do you happen to know the difference between the two options?

If my understanding is correct, the option 'paged_adamw_8bit' reduces the memory footprint of the optimizer by using lower precision (8-bit) for the internal states. Basically, it aims to save memory in the low-spec training environment.

During the quantization for 8 bits, It might happen due to losing the step computing. so finally it cause the training loss drop to zero.

On the other hand, the option 'paged_adamw_32bit' is a kind of default or normal memory usage setting. But I don't know exactly. It needs more checks.

Aug 18 '23 00:08 zayunsna

It doesn't work for me...

It also didn't work for me. Have you figured it out?

Oct 18 '23 04:10 yujiaw98

@JeongChangsu thank you for the suggestion! I tried to comment on that line but it gave me some strange warnings. But your suggestion helped me find this related issue: #78 I replaced optim="paged_adamw_8bit",with optim="paged_adamw_32bit", Now it is running and I will tell you if it works. Do you happen to know the difference between the two options?

If my understanding is correct, the option 'paged_adamw_8bit' reduces the memory footprint of the optimizer by using lower precision (8-bit) for the internal states. Basically, it aims to save memory in the low-spec training environment.

During the quantization for 8 bits, It might happen due to losing the step computing. so finally it cause the training loss drop to zero.

On the other hand, the option 'paged_adamw_32bit' is a kind of default or normal memory usage setting. But I don't know exactly. It needs more checks.

Do you have any source? Any clue where I can read more on this? I've tried googling the names but no luck

Oct 18 '23 08:10 NPap0

I think it occurs because of fp16=True. I removed fp16=True, and it worked. To speed up, I used tf32 option insteed of fp16. But I don't know why it works. Can someone explain me?

Oct 22 '23 11:10 jwj7140

Hi, i'm training "TheBlock/guanaco-7b-hf" with qlora. i have almost the same script from colab tutorial

import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"

import torch
import wandb
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

model_id = "TheBloke/guanaco-7B-HF"
# model_id = "openlm-research/open_llama_7b_700bt_preview"
# model_id = "openlm-research/open_llama_3b_600bt_preview"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token_id = (
    0  # unk. we want this to be different from the eos token
)
tokenizer.padding_side = "left"
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"":0})

from peft import prepare_model_for_kbit_training

model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=8, 
    lora_alpha=32, 
    target_modules=["q_proj","k_proj","v_proj"], 
    lora_dropout=0.05, 
    bias="none", 
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)
print_trainable_parameters(model)

import transformers

trainer = transformers.Trainer(
    model=model,
    train_dataset=data["train"],
    eval_dataset=data['test'],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_steps=2,
        max_steps=data["train"].num_rows,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=1,
        output_dir="outputs",
        optim="paged_adamw_8bit",
        # load_best_model_at_end=True,
        # evaluation_strategy="steps",
        # save_strategy="steps",
        save_steps=1000,
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()

Is there any error because after a few steps my training loss drops to cero and never goes up againg. My data is the evoling instructions of wizard and my prompt is alpaca.

Please check this one:

A major bug in 8-bit optimizers that could cause some instabilities later in training has been fixed. Please update bitsandbytes to 0.41.1 via pip install -U bitsandbytes. Now 8-bit optimizer should again reproduce 32-bit optimizer performance.

Best regards,

Shuyue Nov. 25th, 2023

Nov 26 '23 03:11 SuperBruceJia

qlora qlora copied to clipboard

Loss drop in cero

qlora
qlora copied to clipboard