unsloth `model.eval(); model.train()` makes backward pass undifferentiable.

If pad tokens are used, and model.eval(); model.train() is called, Unsloth backward pass is undifferentiable, resulting in nan.

Reproduction script (expand):

import torch
from transformers import AutoTokenizer
from trl.trainer.utils import peft_module_casting_to_bf16
from unsloth import FastLanguageModel


base_model_uri = "HuggingFaceH4/mistral-7b-sft-beta"

tokenizer = AutoTokenizer.from_pretrained(
    base_model_uri,
    padding_side="left",
    trust_remote_code=True,
)
tokenizer.add_special_tokens({"pad_token": "[PAD]"})


model, _ = FastLanguageModel.from_pretrained(
    model_name=base_model_uri,
    max_seq_length=2048,
    dtype=None,
    load_in_4bit=True,
)
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    lora_alpha=64,
    lora_dropout=0,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    bias="none",
    use_gradient_checkpointing="unsloth",
    max_seq_length=2048
)
peft_module_casting_to_bf16(model)



query_responses = torch.tensor(
       [[  523, 28766,  1838, 28766, 28767,    13,  5238,   264,   752, 28710,
         28733,  7971,  2838,   684,   264,  2071,   302, 15311,   266,  1228,
         28713,   693,  1580,  4602,  8599,   477,   396, 17054, 18793, 28723,
             2, 28705,    13, 28789, 28766,   489, 11143, 28766, 28767,    13,
          1313,   553,   750, 28705, 28787,  3370,  1854,   272,   389, 12306,
         16787,   356,   813,  9873,   304,   378,  3593,   737,   264, 21229,
         10301, 28723,   415, 11296,   477, 12859,  2764,  8658, 28742, 28707,
          1368,  2032, 28725,   304,   736,  8658, 28742, 28707,  1287,   302,
         ],
        [32000, 32000, 32000, 32000, 32000, 32000,   523, 28766,  1838, 28766,
         28767,    13,  3998,   264,   910, 28733,   532,  8327,   356,  3667,
           264,  3588, 25009,   607,  1253,  1413,  1010,   670,  2164, 28723,
             2, 28705,    13, 28789, 28766,   489, 11143, 28766, 28767,    13,
          2198,  5514, 10352,   298,  8670,  1012,  1370, 28725,   624,   302,
           272,  1080, 14714,  8670,  1339,   478,   460,  6252,   349,   272,
          4099,   302, 25009,   607,  8300, 28723,  4023,   590,   993,   459,
           347,  5894, 14573,  2783, 28725,   590,   460,  2719,  7887, 28725,
         ]],
       device='cuda:0')

context_length = 40


def get_loss_and_backwards(model, calc_vpred=False, train_eval=False):
    # simple loss function to demonstrate the issue

    if train_eval:
        model.eval()
        model.train()

    with torch.cuda.amp.autocast(dtype=torch.bfloat16):
        attention_mask = query_responses != tokenizer.pad_token_id
        input_ids = torch.masked_fill(query_responses, ~attention_mask, 0)
        output = model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            return_dict=True,
            output_hidden_states=True,
            use_cache=False,
        )
        loss = output.logits.mean()

        loss.backward()


if __name__ == "__main__":
    with torch.autograd.detect_anomaly():
        get_loss_and_backwards(model, train_eval=True)

Overview:

get_loss_and_backwards(model, train_eval=False) works

get_loss_and_backwards(model, train_eval=True) fails with

  File "/root/repro.py", line 86, in <module>
    get_loss_and_backwards(model, train_eval=True)
  File "/root/repro.py", line 81, in get_loss_and_backwards
    loss.backward()
  File "/opt/conda/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward
    torch.autograd.backward(
  File "/opt/conda/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward
    _engine_run_backward(
  File "/opt/conda/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/opt/conda/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply
    return user_fn(self, *args)
  File "/opt/conda/lib/python3.10/site-packages/torch/cuda/amp/autocast_mode.py", line 142, in decorate_bwd
    return bwd(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/unsloth/models/_utils.py", line 399, in backward
    torch.autograd.backward(output, dY)
  File "/opt/conda/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward
    _engine_run_backward(
  File "/opt/conda/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: Function 'LoRA_MLPBackward' returned nan values in its 0th output.

I use the following decorator to speed up generation when training with huggingface AutoModels. It'd be great if this decorator worked with unsloth as well! (I noticed a 3x speedup in unsloth generation with model.eval() set)

class fast_eval_mode(ContextDecorator):
    """
    Convert to model.eval(), then revert to previous state

    Behavior
    - DOESNT disable grad
    - Disable dropout layers
    - Freeze BatchNorm
    """
    def __init__(self, model):
        self.model = model

    def __enter__(self):
        self.was_training = self.model.training
        if self.was_training:
            self.model.eval()

    def __exit__(self, exc_type, exc_val, exc_tb):
        if self.was_training:
            self.model.train()

May 24 '24 20:05 lapp0

Interesting and thanks for the investigation!

May 25 '24 09:05 danielhanchen

Hi, I'm working on a project which requires fine-tuning llama-3-8b-Instruct-bnb-4bit on custom dataset. However when I try to increase per_device_batch_size from 1 to any value, I'm facing same error RuntimeError: Function 'LoRA_MLPBackward' returned nan values in its 0th output.. My notebook very similar to provided notebooks from unsloth. I read the issues, however my knowledge was not enough to understand the error. Are there anything I can do to fix this problem?

Jul 23 '24 13:07 Mfaytinn

Oh apologies I did not solve this issue yet apologies - I'll try to take a look again but can't guarantee anything sorry :(

Jul 26 '24 06:07 danielhanchen

Thank you for your attention and time.

Jul 26 '24 11:07 Mfaytinn

I met the same problem, whenever I set model.train() and use left padding, the loss cannot be differentiable and logits for those padding is all zero.

I also have no idea why I cannot get loss backward even if I do not set model.train(). I try this simple example on the provided llama3.1 colab.

and the error is this:

Could anyone tell me what I am doing wrong, Why my llama loss cannot get backwarded even in this simple case.

Aug 08 '24 14:08 YZHang2333

I'll investigate your new issue!

Aug 10 '24 03:08 danielhanchen

I met the same problem, whenever I set model.train() and use left padding, the loss cannot be differentiable and logits for those padding is all zero.

I also have no idea why I cannot get loss backward even if I do not set model.train(). I try this simple example on the provided llama3.1 colab.

and the error is this:

Could anyone tell me what I am doing wrong, Why my llama loss cannot get backwarded even in this simple case.

Hi are you still having the issue? :)

Jan 19 '25 07:01 shimmyshimmer

unsloth unsloth copied to clipboard

`model.eval(); model.train()` makes backward pass undifferentiable.

Overview:

unsloth
unsloth copied to clipboard