qlora icon indicating copy to clipboard operation
qlora copied to clipboard

Loss drop in cero

Open jocastrocUnal opened this issue 1 year ago • 15 comments

Hi, i'm training "TheBlock/guanaco-7b-hf" with qlora. i have almost the same script from colab tutorial

import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"

import torch
import wandb
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

model_id = "TheBloke/guanaco-7B-HF"
# model_id = "openlm-research/open_llama_7b_700bt_preview"
# model_id = "openlm-research/open_llama_3b_600bt_preview"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token_id = (
    0  # unk. we want this to be different from the eos token
)
tokenizer.padding_side = "left"
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"":0})

from peft import prepare_model_for_kbit_training

model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=8, 
    lora_alpha=32, 
    target_modules=["q_proj","k_proj","v_proj"], 
    lora_dropout=0.05, 
    bias="none", 
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)
print_trainable_parameters(model)

import transformers

trainer = transformers.Trainer(
    model=model,
    train_dataset=data["train"],
    eval_dataset=data['test'],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_steps=2,
        max_steps=data["train"].num_rows,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=1,
        output_dir="outputs",
        optim="paged_adamw_8bit",
        # load_best_model_at_end=True,
        # evaluation_strategy="steps",
        # save_strategy="steps",
        save_steps=1000,
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()

Is there any error because after a few steps my training loss drops to cero and never goes up againg. My data is the evoling instructions of wizard and my prompt is alpaca. image

jocastrocUnal avatar Jun 02 '23 15:06 jocastrocUnal

same issue.

jwj7140 avatar Jun 04 '23 08:06 jwj7140

I also have the same problem. The loss first decreases, then it slowly grows till it drops to zero. The image below shows the training loss in the first epoch.

I used llama 7B. Could it be related to the learning rate? I used 3e-4.

image

crux82 avatar Jun 04 '23 08:06 crux82

I also have the same problem. The loss first decreases, then it slowly grows till it drops to zero. The image below shows the training loss in the first epoch.

I used llama 7B. Could it be related to the learning rate? I used 3e-4.

image

I also try this training routine (guanaco 7b qlora)

python qlora.py \
    --model_name_or_path huggyllama/llama-7b \
    --output_dir ./output/guanaco-7b \
    --logging_steps 10 \
    --save_strategy steps \
    --data_seed 42 \
    --save_steps 500 \
    --save_total_limit 40 \
    --evaluation_strategy steps \
    --eval_dataset_size 1024 \
    --max_eval_samples 1000 \
    --per_device_eval_batch_size 1 \
    --max_new_tokens 32 \
    --dataloader_num_workers 3 \
    --group_by_length \
    --logging_strategy steps \
    --remove_unused_columns False \
    --do_train \
    --do_eval \
    --do_mmlu_eval \
    --lora_r 64 \
    --lora_alpha 16 \
    --lora_modules all \
    --double_quant \
    --quant_type nf4 \
    --bf16 \
    --bits 4 \
    --warmup_ratio 0.03 \
    --lr_scheduler_type constant \
    --gradient_checkpointing \
    --dataset oasst1 \
    --source_max_len 16 \
    --target_max_len 512 \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 16 \
    --max_steps 1875 \
    --eval_steps 187 \
    --learning_rate 0.0002 \
    --adam_beta2 0.999 \
    --max_grad_norm 0.3 \
    --lora_dropout 0.1 \
    --weight_decay 0.0 \
    --seed 0

i suspect of "lr_scheduler_type" and the lr to. But the funny here is the steps. Why the steps for oasst data are 1875? this is only 20% tops of the epoch?

jocastrocUnal avatar Jun 04 '23 13:06 jocastrocUnal

# optim="paged_adamw_8bit", try commenting it out.

JeongChangsu avatar Jun 09 '23 08:06 JeongChangsu

@JeongChangsu thank you for the suggestion!

I tried to comment on that line but it gave me some strange warnings. But your suggestion helped me find this related issue: #78

I replaced optim="paged_adamw_8bit",with optim="paged_adamw_32bit",

Now it is running and I will tell you if it works.

Do you happen to know the difference between the two options?

crux82 avatar Jun 09 '23 21:06 crux82

What is the difference between optim="paged_adamw_8bit" and optim="paged_adamw_32bit"?

ChuanMeng avatar Jun 26 '23 15:06 ChuanMeng

@JeongChangsu thank you for the suggestion!

I tried to comment on that line but it gave me some strange warnings. But your suggestion helped me find this related issue: #78

I replaced optim="paged_adamw_8bit",with optim="paged_adamw_32bit",

Now it is running and I will tell you if it works.

Do you happen to know the difference between the two options?

does "paged_adamw_32bit" work?

ngavcc avatar Jul 18 '23 01:07 ngavcc

Yes, it worked for me.

crux82 avatar Jul 18 '23 08:07 crux82

Yes, it worked for me.

Many thanks

ngavcc avatar Jul 18 '23 08:07 ngavcc

It doesn't work for me...

michelle-chou25 avatar Aug 02 '23 03:08 michelle-chou25

@JeongChangsu thank you for the suggestion!

I tried to comment on that line but it gave me some strange warnings. But your suggestion helped me find this related issue: #78

I replaced optim="paged_adamw_8bit",with optim="paged_adamw_32bit",

Now it is running and I will tell you if it works.

Do you happen to know the difference between the two options?

If my understanding is correct, the option 'paged_adamw_8bit' reduces the memory footprint of the optimizer by using lower precision (8-bit) for the internal states. Basically, it aims to save memory in the low-spec training environment.

During the quantization for 8 bits, It might happen due to losing the step computing. so finally it cause the training loss drop to zero.

On the other hand, the option 'paged_adamw_32bit' is a kind of default or normal memory usage setting. But I don't know exactly. It needs more checks.

zayunsna avatar Aug 18 '23 00:08 zayunsna

It doesn't work for me...

It also didn't work for me. Have you figured it out?

yujiaw98 avatar Oct 18 '23 04:10 yujiaw98

@JeongChangsu thank you for the suggestion! I tried to comment on that line but it gave me some strange warnings. But your suggestion helped me find this related issue: #78 I replaced optim="paged_adamw_8bit",with optim="paged_adamw_32bit", Now it is running and I will tell you if it works. Do you happen to know the difference between the two options?

If my understanding is correct, the option 'paged_adamw_8bit' reduces the memory footprint of the optimizer by using lower precision (8-bit) for the internal states. Basically, it aims to save memory in the low-spec training environment.

During the quantization for 8 bits, It might happen due to losing the step computing. so finally it cause the training loss drop to zero.

On the other hand, the option 'paged_adamw_32bit' is a kind of default or normal memory usage setting. But I don't know exactly. It needs more checks.

Do you have any source? Any clue where I can read more on this? I've tried googling the names but no luck

NPap0 avatar Oct 18 '23 08:10 NPap0

I think it occurs because of fp16=True. I removed fp16=True, and it worked. To speed up, I used tf32 option insteed of fp16. But I don't know why it works. Can someone explain me?

jwj7140 avatar Oct 22 '23 11:10 jwj7140

Hi, i'm training "TheBlock/guanaco-7b-hf" with qlora. i have almost the same script from colab tutorial

import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"

import torch
import wandb
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

model_id = "TheBloke/guanaco-7B-HF"
# model_id = "openlm-research/open_llama_7b_700bt_preview"
# model_id = "openlm-research/open_llama_3b_600bt_preview"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token_id = (
    0  # unk. we want this to be different from the eos token
)
tokenizer.padding_side = "left"
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"":0})

from peft import prepare_model_for_kbit_training

model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=8, 
    lora_alpha=32, 
    target_modules=["q_proj","k_proj","v_proj"], 
    lora_dropout=0.05, 
    bias="none", 
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)
print_trainable_parameters(model)

import transformers

trainer = transformers.Trainer(
    model=model,
    train_dataset=data["train"],
    eval_dataset=data['test'],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_steps=2,
        max_steps=data["train"].num_rows,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=1,
        output_dir="outputs",
        optim="paged_adamw_8bit",
        # load_best_model_at_end=True,
        # evaluation_strategy="steps",
        # save_strategy="steps",
        save_steps=1000,
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()

Is there any error because after a few steps my training loss drops to cero and never goes up againg. My data is the evoling instructions of wizard and my prompt is alpaca. image

Please check this one:

A major bug in 8-bit optimizers that could cause some instabilities later in training has been fixed. Please update bitsandbytes to 0.41.1 via pip install -U bitsandbytes. Now 8-bit optimizer should again reproduce 32-bit optimizer performance.

Best regards,

Shuyue Nov. 25th, 2023

SuperBruceJia avatar Nov 26 '23 03:11 SuperBruceJia