unsloth icon indicating copy to clipboard operation
unsloth copied to clipboard

Lora downcasting issue

Open kiddyboots216 opened this issue 10 months ago • 18 comments

When creating a PEFT model and then trying to train it, we get an error;

  File "/scratch/gpfs/ashwinee/unsloth/unsloth/kernels/fast_lora.py", line 106, in backward                  
    d_downA = h.t() @ (dY @ downB.t())
RuntimeError: expected mat1 and mat2 to have the same dtype, but got: c10::BFloat16 != float

I suspect this is what the recent Lora downcasting fix PR was addressing. However, I'm still getting an error because dY is a bfloat16 and downB is a float32 (which we coerced it to be in prepare_for_kbit_training).

kiddyboots216 avatar Apr 10 '24 00:04 kiddyboots216

@kiddyboots216 Are you using use_bf16 = True or use_fp16 = True in the Trainer?

danielhanchen avatar Apr 10 '24 02:04 danielhanchen

from unsloth import FastMistralModel
model, tokenizer = FastMistralModel.from_pretrained(
    args.model_path, 
    max_seq_length=512, 
    dtype=torch.bfloat16, 
    load_in_4bit=False, 
    attn_implementation="flash_attention_2", 
    device_map='auto', 
    use_cache=False
    )
    
model = FastMistralModel.get_peft_model(
  model,
  r = 8,
  target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                  "gate_proj", "up_proj", "down_proj",],
  lora_alpha = 16,
  lora_dropout = 0, # Dropout = 0 is currently optimized
  bias = "none",    # Bias = "none" is currently optimized
  use_gradient_checkpointing = False,
  random_state = 3407,
  max_seq_length=512,
  use_rslora=False,
  loftq_config=None
)
    ```

kiddyboots216 avatar Apr 10 '24 02:04 kiddyboots216

@kiddyboots216 Oh wait use FastLanguageModel Also you can copy paste our COlab notebook if that works https://colab.research.google.com/drive/1Dyauq4kTZoLewQ1cApceUQVNcnnNTzg_?usp=sharing

danielhanchen avatar Apr 10 '24 02:04 danielhanchen

For a full example:

from unsloth import FastLanguageModel
import torch
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset
max_seq_length = 2048 # Supports RoPE Scaling interally, so choose any!
# Get LAION dataset
url = "https://huggingface.co/datasets/laion/OIG/resolve/main/unified_chip2.jsonl"
dataset = load_dataset("json", data_files = {"train" : url}, split = "train")

# Load Llama model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/mistral-7b-bnb-4bit", # Supports Llama, Mistral - replace this!
    max_seq_length = max_seq_length,
    dtype = None,
    load_in_4bit = True,
)

# Do model patching and add fast LoRA weights
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    use_gradient_checkpointing = True,
    random_state = 3407,
    max_seq_length = max_seq_length,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

trainer = SFTTrainer(
    model = model,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    tokenizer = tokenizer,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 10,
        max_steps = 60,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 1,
        output_dir = "outputs",
        optim = "adamw_8bit",
        seed = 3407,
    ),
)
trainer.train()

danielhanchen avatar Apr 10 '24 02:04 danielhanchen

Thanks, I was using FastLanguageModel initially but just was using FastMistralModel for debugging. And the only part of SFTTrainer that seems to actually be happening (because the error is on the first backward) is calling backward on the logit loss.

If the error isn't reproducible I can write down a working example for reproducibility. I figured it's something you're aware of since I see the closed PR to add support for Lora downcasting.

kiddyboots216 avatar Apr 10 '24 02:04 kiddyboots216

Oh thats actually upcasting!! So A and B were incorrect in float16, causing incorrect training runs

danielhanchen avatar Apr 10 '24 03:04 danielhanchen

Gotcha. If we look at

temp = (dY @ downB.t())

Then the error indicates that downB is a float32 (which is correct) but dY is a bfloat16. Should it be a float32?

kiddyboots216 avatar Apr 10 '24 03:04 kiddyboots216

@kiddyboots216 Ohh no so what we're doing is correct. It seems like you're not using mixed precision for training (fp16 = True, bf16 = True)

danielhanchen avatar Apr 10 '24 03:04 danielhanchen

Sorry typo -meant "dY is a bfloat16" (from original error message).

kiddyboots216 avatar Apr 10 '24 03:04 kiddyboots216

So @danielhanchen making sure I understand this correctly;

  • training code works with PEFT LoRA because it downcasts everything to bfloat16
  • in Unsloth PEFT we override LoraLayer.update_layer to not do the "self.to(weight.dtype)" therefore the LoRA weights are float32
  • this results in an error where downB is float32 (but this is intentional) but dY is a bfloat16 (I'm guessing this is not intentional?)

kiddyboots216 avatar Apr 10 '24 04:04 kiddyboots216

I'm currently getting this error with peft=0.10.0 and installing unsloth from source (git clone, pip install -e .)

Here's the stacktrace;

==((====))==  Unsloth: Fast Mistral patching release 2024.4                                                             
   \\   /|    GPU: NVIDIA A100 80GB PCIe. Max memory: 79.318 GB. Platform = Linux.                                      
O^O/ \_/ \    Pytorch: 2.2.2. CUDA = 8.0. CUDA Toolkit = 12.1.                                                          
\        /    Bfloat16 = TRUE. Xformers = 0.0.25.post1. FA = True.                                                      
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████| 2/2 [00:13<00:00,  6.86s/it]
Unsloth 2024.4 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.
File "/scratch/gpfs/ashwinee/alignment-durability/rlaif/filter_dataset_2.py", line 463, in save_gradient_norms       
    loss.mean().backward()
  File "/scratch/gpfs/ashwinee/envs/unsloth/lib/python3.10/site-packages/torch/_tensor.py", line 522, in backward      
    torch.autograd.backward(
  File "/scratch/gpfs/ashwinee/envs/unsloth/lib/python3.10/site-packages/torch/autograd/__init__.py", line 266, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass                     
  File "/scratch/gpfs/ashwinee/envs/unsloth/lib/python3.10/site-packages/torch/autograd/function.py", line 289, in apply
    return user_fn(self, *args)
  File "/scratch/gpfs/ashwinee/envs/unsloth/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 319, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)                                                         
  File "/scratch/gpfs/ashwinee/envs/unsloth/lib/python3.10/site-packages/torch/autograd/__init__.py", line 266, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass                     
  File "/scratch/gpfs/ashwinee/envs/unsloth/lib/python3.10/site-packages/torch/autograd/function.py", line 289, in apply
    return user_fn(self, *args)
  File "/scratch/gpfs/ashwinee/envs/unsloth/lib/python3.10/site-packages/torch/cuda/amp/autocast_mode.py", line 142, in
decorate_bwd
    return bwd(*args, **kwargs)
  File "/scratch/gpfs/ashwinee/unsloth/unsloth/kernels/fast_lora.py", line 106, in backward                            
    d_downA = h.t() @ (dY @ downB.t())
RuntimeError: expected mat1 and mat2 to have the same dtype, but got: c10::BFloat16 != float

and full code;

from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        args.model_path, 
        max_seq_length=512, 
        load_in_4bit=True, 
        )
model = FastLanguageModel.get_peft_model(
        model,
        r = 8,
        target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                        "gate_proj", "up_proj", "down_proj",],
        lora_alpha = 16,
        lora_dropout = 0, # Dropout = 0 is currently optimized
        bias = "none",    # Bias = "none" is currently optimized
        use_gradient_checkpointing = True,
        random_state = 3407,
        max_seq_length=512,
        use_rslora=False,
        loftq_config=None
    )
loss = -model(batch).logits.to(torch.float32)
loss.mean().backward()

kiddyboots216 avatar Apr 10 '24 19:04 kiddyboots216

@kiddyboots216

For training, dY is in bfloat16. LoRA A and B must be in float32. This is for mixed precision training.

The code you provided will not run at all, because you are upcasting the loss to torch.float32, and not doing mixed precision training. Wrap your code with

with torch.cuda.amp.autocast(dtype = torch.bfloat16):
    loss = -model(batch).logits.to(torch.float32)
    loss.mean().backward()

danielhanchen avatar Apr 11 '24 09:04 danielhanchen

For my training code(I did not use huggingface trainer),

  model, tokenizer = FastLanguageModel.from_pretrained("xxxx", dtype=getattr(torch, 'bfloat16'),
                                                      max_seq_length=768, load_in_4bit=True)
 model = FastLanguageModel.get_peft_model(model, r=64, lora_alpha=16, lora_dropout=0, bias="none",
                                                     random_state=32
                                                     target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], use_dora=False)
           

If I set lora_dropout to 0.05 and without amp, the training code work well If I set lora_dropout to 0 without amp it error output:

 d_downA = h.t() @ (dY @ downB.t())
RuntimeError: expected mat1 and mat2 to have the same dtype, but got: c10::BFloat16 != float

if I set lora_dropout to 0 and use torch.cuda.amp.autocast, it will error out:

 "_amp_foreach_non_finite_check_and_unscale_cuda" not implemented for 'BFloat16'

world2vec avatar May 17 '24 01:05 world2vec

I am also facing the same issues. I experimented what @world2vec posted and can confirm, with dropout the training runs, with lora dropout=0 it does not run. Tested with float16, bfloat16, and float32. Also, this issue appears to be specific to the Volta architecture. When running on a V100 I cannot make the training work with lora dropout=0 (even with the proper installation inside a container). However, on an RTX3090 it runs with no issues.

BrunoBSM avatar Aug 09 '24 17:08 BrunoBSM

@BrunoBSM Wait so does normal Unsloth work on V100s? T4s work for now.

@world2vec Apologies on the delay - this got lost! When dropout = 0, Unsloth will call the optimized fastpaths - it seems like the autocasting isn't propagating correctly weirdly - if this is a custom Pytorch trainer, presumably somewhere the autocast call wasn't used correctly, but I'm unsure sorry

danielhanchen avatar Aug 10 '24 02:08 danielhanchen

@danielhanchen For my case is on RTX4090. torch AMP with float16 work well, it does not work for bfloat16.

world2vec avatar Aug 10 '24 04:08 world2vec

@danielhanchen I am not sure what you mean by normal Unsloth, though I have not been able to make it work on the V100 with lora dropout = 0. Should I set the dropout to anything > 0 I get the warning for a performance drop, but training does run.

BrunoBSM avatar Aug 10 '24 23:08 BrunoBSM

Ye so a dropout = 0 is optimized , but anything else is not - it still runs correct.

@world2vec sadly I'm unsure why your RTX 4090 isn't working sorry :(

danielhanchen avatar Aug 13 '24 06:08 danielhanchen