unsloth
unsloth copied to clipboard
`model.eval(); model.train()` makes backward pass undifferentiable.
If pad tokens are used, and model.eval(); model.train() is called, Unsloth backward pass is undifferentiable, resulting in nan.
Reproduction script (expand):
import torch
from transformers import AutoTokenizer
from trl.trainer.utils import peft_module_casting_to_bf16
from unsloth import FastLanguageModel
base_model_uri = "HuggingFaceH4/mistral-7b-sft-beta"
tokenizer = AutoTokenizer.from_pretrained(
base_model_uri,
padding_side="left",
trust_remote_code=True,
)
tokenizer.add_special_tokens({"pad_token": "[PAD]"})
model, _ = FastLanguageModel.from_pretrained(
model_name=base_model_uri,
max_seq_length=2048,
dtype=None,
load_in_4bit=True,
)
model = FastLanguageModel.get_peft_model(
model,
r=16,
lora_alpha=64,
lora_dropout=0,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
bias="none",
use_gradient_checkpointing="unsloth",
max_seq_length=2048
)
peft_module_casting_to_bf16(model)
query_responses = torch.tensor(
[[ 523, 28766, 1838, 28766, 28767, 13, 5238, 264, 752, 28710,
28733, 7971, 2838, 684, 264, 2071, 302, 15311, 266, 1228,
28713, 693, 1580, 4602, 8599, 477, 396, 17054, 18793, 28723,
2, 28705, 13, 28789, 28766, 489, 11143, 28766, 28767, 13,
1313, 553, 750, 28705, 28787, 3370, 1854, 272, 389, 12306,
16787, 356, 813, 9873, 304, 378, 3593, 737, 264, 21229,
10301, 28723, 415, 11296, 477, 12859, 2764, 8658, 28742, 28707,
1368, 2032, 28725, 304, 736, 8658, 28742, 28707, 1287, 302,
],
[32000, 32000, 32000, 32000, 32000, 32000, 523, 28766, 1838, 28766,
28767, 13, 3998, 264, 910, 28733, 532, 8327, 356, 3667,
264, 3588, 25009, 607, 1253, 1413, 1010, 670, 2164, 28723,
2, 28705, 13, 28789, 28766, 489, 11143, 28766, 28767, 13,
2198, 5514, 10352, 298, 8670, 1012, 1370, 28725, 624, 302,
272, 1080, 14714, 8670, 1339, 478, 460, 6252, 349, 272,
4099, 302, 25009, 607, 8300, 28723, 4023, 590, 993, 459,
347, 5894, 14573, 2783, 28725, 590, 460, 2719, 7887, 28725,
]],
device='cuda:0')
context_length = 40
def get_loss_and_backwards(model, calc_vpred=False, train_eval=False):
# simple loss function to demonstrate the issue
if train_eval:
model.eval()
model.train()
with torch.cuda.amp.autocast(dtype=torch.bfloat16):
attention_mask = query_responses != tokenizer.pad_token_id
input_ids = torch.masked_fill(query_responses, ~attention_mask, 0)
output = model(
input_ids=input_ids,
attention_mask=attention_mask,
return_dict=True,
output_hidden_states=True,
use_cache=False,
)
loss = output.logits.mean()
loss.backward()
if __name__ == "__main__":
with torch.autograd.detect_anomaly():
get_loss_and_backwards(model, train_eval=True)
Overview:
get_loss_and_backwards(model, train_eval=False) works
get_loss_and_backwards(model, train_eval=True) fails with
File "/root/repro.py", line 86, in <module>
get_loss_and_backwards(model, train_eval=True)
File "/root/repro.py", line 81, in get_loss_and_backwards
loss.backward()
File "/opt/conda/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward
torch.autograd.backward(
File "/opt/conda/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward
_engine_run_backward(
File "/opt/conda/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/opt/conda/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply
return user_fn(self, *args)
File "/opt/conda/lib/python3.10/site-packages/torch/cuda/amp/autocast_mode.py", line 142, in decorate_bwd
return bwd(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/unsloth/models/_utils.py", line 399, in backward
torch.autograd.backward(output, dY)
File "/opt/conda/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward
_engine_run_backward(
File "/opt/conda/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: Function 'LoRA_MLPBackward' returned nan values in its 0th output.
I use the following decorator to speed up generation when training with huggingface AutoModels. It'd be great if this decorator worked with unsloth as well! (I noticed a 3x speedup in unsloth generation with model.eval() set)
class fast_eval_mode(ContextDecorator):
"""
Convert to model.eval(), then revert to previous state
Behavior
- DOESNT disable grad
- Disable dropout layers
- Freeze BatchNorm
"""
def __init__(self, model):
self.model = model
def __enter__(self):
self.was_training = self.model.training
if self.was_training:
self.model.eval()
def __exit__(self, exc_type, exc_val, exc_tb):
if self.was_training:
self.model.train()
Interesting and thanks for the investigation!
Hi, I'm working on a project which requires fine-tuning llama-3-8b-Instruct-bnb-4bit on custom dataset. However when I try to increase per_device_batch_size from 1 to any value, I'm facing same error RuntimeError: Function 'LoRA_MLPBackward' returned nan values in its 0th output.. My notebook very similar to provided notebooks from unsloth. I read the issues, however my knowledge was not enough to understand the error. Are there anything I can do to fix this problem?
Oh apologies I did not solve this issue yet apologies - I'll try to take a look again but can't guarantee anything sorry :(
Thank you for your attention and time.
I met the same problem, whenever I set model.train() and use left padding, the loss cannot be differentiable and logits for those padding is all zero.
I also have no idea why I cannot get loss backward even if I do not set model.train(). I try this simple example on the provided llama3.1 colab.
and the error is this:
Could anyone tell me what I am doing wrong, Why my llama loss cannot get backwarded even in this simple case.
I'll investigate your new issue!
I met the same problem, whenever I set model.train() and use left padding, the loss cannot be differentiable and logits for those padding is all zero.
I also have no idea why I cannot get loss backward even if I do not set model.train(). I try this simple example on the provided llama3.1 colab.
and the error is this:
Could anyone tell me what I am doing wrong, Why my llama loss cannot get backwarded even in this simple case.
Hi are you still having the issue? :)