unsloth
unsloth copied to clipboard
Lora downcasting issue
When creating a PEFT model and then trying to train it, we get an error;
File "/scratch/gpfs/ashwinee/unsloth/unsloth/kernels/fast_lora.py", line 106, in backward
d_downA = h.t() @ (dY @ downB.t())
RuntimeError: expected mat1 and mat2 to have the same dtype, but got: c10::BFloat16 != float
I suspect this is what the recent Lora downcasting fix PR was addressing. However, I'm still getting an error because dY is a bfloat16 and downB is a float32 (which we coerced it to be in prepare_for_kbit_training).
@kiddyboots216 Are you using use_bf16 = True
or use_fp16 = True
in the Trainer?
from unsloth import FastMistralModel
model, tokenizer = FastMistralModel.from_pretrained(
args.model_path,
max_seq_length=512,
dtype=torch.bfloat16,
load_in_4bit=False,
attn_implementation="flash_attention_2",
device_map='auto',
use_cache=False
)
model = FastMistralModel.get_peft_model(
model,
r = 8,
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",],
lora_alpha = 16,
lora_dropout = 0, # Dropout = 0 is currently optimized
bias = "none", # Bias = "none" is currently optimized
use_gradient_checkpointing = False,
random_state = 3407,
max_seq_length=512,
use_rslora=False,
loftq_config=None
)
```
@kiddyboots216 Oh wait use FastLanguageModel
Also you can copy paste our COlab notebook if that works https://colab.research.google.com/drive/1Dyauq4kTZoLewQ1cApceUQVNcnnNTzg_?usp=sharing
For a full example:
from unsloth import FastLanguageModel
import torch
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset
max_seq_length = 2048 # Supports RoPE Scaling interally, so choose any!
# Get LAION dataset
url = "https://huggingface.co/datasets/laion/OIG/resolve/main/unified_chip2.jsonl"
dataset = load_dataset("json", data_files = {"train" : url}, split = "train")
# Load Llama model
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/mistral-7b-bnb-4bit", # Supports Llama, Mistral - replace this!
max_seq_length = max_seq_length,
dtype = None,
load_in_4bit = True,
)
# Do model patching and add fast LoRA weights
model = FastLanguageModel.get_peft_model(
model,
r = 16,
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",],
lora_alpha = 16,
lora_dropout = 0, # Supports any, but = 0 is optimized
bias = "none", # Supports any, but = "none" is optimized
use_gradient_checkpointing = True,
random_state = 3407,
max_seq_length = max_seq_length,
use_rslora = False, # We support rank stabilized LoRA
loftq_config = None, # And LoftQ
)
trainer = SFTTrainer(
model = model,
train_dataset = dataset,
dataset_text_field = "text",
max_seq_length = max_seq_length,
tokenizer = tokenizer,
args = TrainingArguments(
per_device_train_batch_size = 2,
gradient_accumulation_steps = 4,
warmup_steps = 10,
max_steps = 60,
fp16 = not torch.cuda.is_bf16_supported(),
bf16 = torch.cuda.is_bf16_supported(),
logging_steps = 1,
output_dir = "outputs",
optim = "adamw_8bit",
seed = 3407,
),
)
trainer.train()
Thanks, I was using FastLanguageModel initially but just was using FastMistralModel for debugging. And the only part of SFTTrainer that seems to actually be happening (because the error is on the first backward) is calling backward on the logit loss.
If the error isn't reproducible I can write down a working example for reproducibility. I figured it's something you're aware of since I see the closed PR to add support for Lora downcasting.
Oh thats actually upcasting!! So A and B were incorrect in float16, causing incorrect training runs
Gotcha. If we look at
temp = (dY @ downB.t())
Then the error indicates that downB is a float32 (which is correct) but dY is a bfloat16. Should it be a float32?
@kiddyboots216 Ohh no so what we're doing is correct. It seems like you're not using mixed precision for training (fp16 = True, bf16 = True)
Sorry typo -meant "dY is a bfloat16" (from original error message).
So @danielhanchen making sure I understand this correctly;
- training code works with PEFT LoRA because it downcasts everything to bfloat16
- in Unsloth PEFT we override LoraLayer.update_layer to not do the "self.to(weight.dtype)" therefore the LoRA weights are float32
- this results in an error where downB is float32 (but this is intentional) but dY is a bfloat16 (I'm guessing this is not intentional?)
I'm currently getting this error with peft=0.10.0 and installing unsloth from source (git clone, pip install -e .)
Here's the stacktrace;
==((====))== Unsloth: Fast Mistral patching release 2024.4
\\ /| GPU: NVIDIA A100 80GB PCIe. Max memory: 79.318 GB. Platform = Linux.
O^O/ \_/ \ Pytorch: 2.2.2. CUDA = 8.0. CUDA Toolkit = 12.1.
\ / Bfloat16 = TRUE. Xformers = 0.0.25.post1. FA = True.
"-____-" Free Apache license: http://github.com/unslothai/unsloth
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████| 2/2 [00:13<00:00, 6.86s/it]
Unsloth 2024.4 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.
File "/scratch/gpfs/ashwinee/alignment-durability/rlaif/filter_dataset_2.py", line 463, in save_gradient_norms
loss.mean().backward()
File "/scratch/gpfs/ashwinee/envs/unsloth/lib/python3.10/site-packages/torch/_tensor.py", line 522, in backward
torch.autograd.backward(
File "/scratch/gpfs/ashwinee/envs/unsloth/lib/python3.10/site-packages/torch/autograd/__init__.py", line 266, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/scratch/gpfs/ashwinee/envs/unsloth/lib/python3.10/site-packages/torch/autograd/function.py", line 289, in apply
return user_fn(self, *args)
File "/scratch/gpfs/ashwinee/envs/unsloth/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 319, in backward
torch.autograd.backward(outputs_with_grad, args_with_grad)
File "/scratch/gpfs/ashwinee/envs/unsloth/lib/python3.10/site-packages/torch/autograd/__init__.py", line 266, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/scratch/gpfs/ashwinee/envs/unsloth/lib/python3.10/site-packages/torch/autograd/function.py", line 289, in apply
return user_fn(self, *args)
File "/scratch/gpfs/ashwinee/envs/unsloth/lib/python3.10/site-packages/torch/cuda/amp/autocast_mode.py", line 142, in
decorate_bwd
return bwd(*args, **kwargs)
File "/scratch/gpfs/ashwinee/unsloth/unsloth/kernels/fast_lora.py", line 106, in backward
d_downA = h.t() @ (dY @ downB.t())
RuntimeError: expected mat1 and mat2 to have the same dtype, but got: c10::BFloat16 != float
and full code;
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
args.model_path,
max_seq_length=512,
load_in_4bit=True,
)
model = FastLanguageModel.get_peft_model(
model,
r = 8,
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",],
lora_alpha = 16,
lora_dropout = 0, # Dropout = 0 is currently optimized
bias = "none", # Bias = "none" is currently optimized
use_gradient_checkpointing = True,
random_state = 3407,
max_seq_length=512,
use_rslora=False,
loftq_config=None
)
loss = -model(batch).logits.to(torch.float32)
loss.mean().backward()
@kiddyboots216
For training, dY is in bfloat16. LoRA A and B must be in float32. This is for mixed precision training.
The code you provided will not run at all, because you are upcasting the loss to torch.float32, and not doing mixed precision training. Wrap your code with
with torch.cuda.amp.autocast(dtype = torch.bfloat16):
loss = -model(batch).logits.to(torch.float32)
loss.mean().backward()
For my training code(I did not use huggingface trainer),
model, tokenizer = FastLanguageModel.from_pretrained("xxxx", dtype=getattr(torch, 'bfloat16'),
max_seq_length=768, load_in_4bit=True)
model = FastLanguageModel.get_peft_model(model, r=64, lora_alpha=16, lora_dropout=0, bias="none",
random_state=32
target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], use_dora=False)
If I set lora_dropout to 0.05 and without amp, the training code work well If I set lora_dropout to 0 without amp it error output:
d_downA = h.t() @ (dY @ downB.t())
RuntimeError: expected mat1 and mat2 to have the same dtype, but got: c10::BFloat16 != float
if I set lora_dropout to 0 and use torch.cuda.amp.autocast, it will error out:
"_amp_foreach_non_finite_check_and_unscale_cuda" not implemented for 'BFloat16'
I am also facing the same issues. I experimented what @world2vec posted and can confirm, with dropout the training runs, with lora dropout=0 it does not run. Tested with float16, bfloat16, and float32. Also, this issue appears to be specific to the Volta architecture. When running on a V100 I cannot make the training work with lora dropout=0 (even with the proper installation inside a container). However, on an RTX3090 it runs with no issues.
@BrunoBSM Wait so does normal Unsloth work on V100s? T4s work for now.
@world2vec Apologies on the delay - this got lost! When dropout = 0, Unsloth will call the optimized fastpaths - it seems like the autocasting isn't propagating correctly weirdly - if this is a custom Pytorch trainer, presumably somewhere the autocast call wasn't used correctly, but I'm unsure sorry
@danielhanchen For my case is on RTX4090. torch AMP with float16 work well, it does not work for bfloat16.
@danielhanchen I am not sure what you mean by normal Unsloth, though I have not been able to make it work on the V100 with lora dropout = 0. Should I set the dropout to anything > 0 I get the warning for a performance drop, but training does run.
Ye so a dropout = 0 is optimized , but anything else is not - it still runs correct.
@world2vec sadly I'm unsure why your RTX 4090 isn't working sorry :(