unsloth Context length for DPO on a 7b model

I'm working on using DPO based on the DPO Zephyr Unsloth Example.ipynb notebook.

I'm loading the model like so:

max_seq_length = 32768
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "models/DeepSeek-R1-Distill-Qwen-7B", # Choose ANY! eg mistralai/Mistral-7B-Instruct-v0.2
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

Unsloth 2025.1.8: Fast Qwen2 patching. Transformers: 4.48.2.
GPU: NVIDIA RTX A6000. Max memory: 47.529 GB. Platform: Linux.
Torch: 2.4.0+cu121. CUDA: 8.6. CUDA Toolkit: 12.1. Triton: 3.0.0
Bfloat16 = TRUE. FA [Xformers = 0.0.27.post2. FA2 = True]

Lora config:

model = FastLanguageModel.get_peft_model(
    model,
    r = 64, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 64,
    lora_dropout = 0, # Currently only supports dropout = 0
    bias = "none",    # Currently only supports bias = "none"
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

And trainer:

dpo_trainer = DPOTrainer(
    model = model,
    ref_model = None,
    args = DPOConfig(
        per_device_train_batch_size = 1,
        gradient_accumulation_steps = 8,
        warmup_ratio = 0.1,
        num_train_epochs = 3,
        learning_rate = 5e-6,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.0,
        lr_scheduler_type = "linear",
        seed = 42,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
    beta = 0.1,
    train_dataset = trl_dataset,
    # eval_dataset = raw_datasets["test"],
    tokenizer = tokenizer,
    # max_length = 11000,
    max_length = 16000,
    max_prompt_length = 512,
)

With max_length = 11000 everything works fine, but trying to use 16k results in OOM. I saw in the docs that unsloth supports very long context lengths for 7b models, do I need to set anything else besides what I'm doing? Or is this due to DPO being more resource intensive?

Also, is there a way to run this on 2 GPUs to speed up training? (nvlink enabled if it matters)

Any feedback is appreciated. Thanks!

Feb 04 '25 10:02 gmonair

try downgrading trl to 0.13.0 after installing unsloth

!pip uninstall -y trl
!pip install trl==0.13.0

trl 0.14.0 was released last week which caused some issues for me with unsloth DPO trainer, on prior versions i could finetune 8b model at 24k context length on colab A100 40gb but got OOM after latest trl release

Feb 05 '25 09:02 Dillion

try downgrading trl to 0.13.0 after installing unsloth

Thanks, I tried it but I still get OOM even after downgrading trl.

Feb 05 '25 18:02 gmonair

Sorry on the delay - sadly DPO does in fact use more memory than normal finetuning - I'm working on reducing VRAM usage which should definitely help

Feb 10 '25 13:02 danielhanchen

unsloth unsloth copied to clipboard

Context length for DPO on a 7b model

unsloth
unsloth copied to clipboard