unsloth GRPOTrainer example works with trl but generate "noise" with unsloth

trafficstars

Hi, I'm running a simple example of GRPOTrainer in plain trl and it runs fine (using the very same conda env I use for unsloth):

grpo_example2.txt

After MANY iterations the text becomes garbage but I think it is reasonable given the reward function used.

I tried to port this to unsloth, it runs, but the model generates "noise" after the very first fine tuning iteration:

prova_grpo.txt

First completion is fine:

reward_function completions: I got blamed, and the girl is in the same classes, for what i didn't do.

the following ones are "noise":

reward_function completions: back.Peeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee reward_function completions: .Pee.Pee est.Peeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee

Environment details and full log:

log.txt

This might be related https://github.com/unslothai/unsloth/issues/1836 but I'm already using 3.11.11

Also: https://github.com/unslothai/unsloth/issues/1672 tried 2025.2.12 but it's still the same.

I also tried unsloth/llama-3-8b-bnb-4bit with same results.

What am I doing wrong?

Thanks

Feb 27 '25 11:02 nottrz

I encountered the same issue as you did. I checked all the installation versions on the official Colab and ensured that they were consistent, but the problem still persisted. Eventually, I set vllm_cache=True and found that the model could run normally and generate proper sequences. To be more specific, the settings are as follows:

model, tokenizer = FastLanguageModel.from_pretrained(
        model_name=model_path,
        max_seq_length=max_seq_length,
        load_in_4bit=True,        
        fast_inference=True,      # set True if you want vLLM fast inference
        max_lora_rank=lora_rank,
        gpu_memory_utilization=0.7
    )

training_args_lyrics = GRPOConfig(
    use_vllm = True, # use vLLM for fast inference!
    learning_rate = 5e-6,
    adam_beta1 = 0.9,
    adam_beta2 = 0.99,
    weight_decay = 0.1,
    warmup_ratio = 0.1,
    lr_scheduler_type = "cosine",
    optim = "paged_adamw_8bit",
    logging_steps = 1,
    bf16 = is_bfloat16_supported(),
    fp16 = not is_bfloat16_supported(),
    per_device_train_batch_size = 4,
    gradient_accumulation_steps = 2, # Increase to 4 for smoother training
    num_generations = 8, # Decrease if out of memory
    max_prompt_length = 768,
    max_completion_length = 768,
    num_train_epochs = 2, # Set to 1 for a full training run
    # max_steps = 50,
    save_steps = 250,
    max_grad_norm = 0.1,
    report_to = "none", # Can use Weights & Biases
    output_dir = "outputs_lyrics_phase",
)

With these settings, the program runs smoothly. It seems that the current models only support vllm-based gradient backpropagation. Without enabling vllm_cache, the first batch of data might be normal, but subsequent batches often encounter repetitive issues. However, once vllm_cache is turned on, the aforementioned problems are resolved!

Feb 27 '25 14:02 StarLight1212

I encountered the same issue as you did. I checked all the installation versions on the official Colab and ensured that they were consistent, but the problem still persisted. Eventually, I set vllm_cache=True and found that the model could run normally and generate proper sequences. To be more specific, the settings are as follows: . . . With these settings, the program runs smoothly. It seems that the current models only support vllm-based gradient backpropagation. Without enabling vllm_cache, the first batch of data might be normal, but subsequent batches often encounter repetitive issues. However, once vllm_cache is turned on, the aforementioned problems are resolved!

@StarLight1212, I cannot see vllm_cache=True anywhere in your snippet.

Feb 27 '25 14:02 DiTo97

@StarLight1212 Thank you very much for your detailed answer :)

I added fast_inference, enforce_eager, gpu_memory_utilization here:

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/llama-3-8b-bnb-4bit",
    #model_name = "unsloth/Llama-3.2-3B-Instruct-bnb-4bit",
    fast_inference=True,
    enforce_eager=True,
    max_seq_length = max_seq_length,
    dtype = None,
    load_in_4bit = True,
    gpu_memory_utilization=0.7
)

and use_vllm = True in the GRPOConfig

Now it works with the Llama 3B and 8B!

Note:

pip install vllm

downgraded several core packages but everything works fine.

Feb 27 '25 16:02 nottrz

The same problem, which is not resolved yet.

Mar 01 '25 19:03 Summer142857

The same problem, which is not resolved yet.

I found the way to resolve it. You can just downgrade your unsloth to 2025.2.12 and do not use vllm. You can also refer to #1810

Mar 01 '25 20:03 Summer142857

@nottrz @DiTo97 @Summer142857 @Summer142857 Apologies just fixed! For Colab / Kaggle, please restart and run all. For local machines, please do:

pip install --force-reinstall --upgrade --no-cache-dir --no-deps unsloth unsloth_zoo

Mar 05 '25 13:03 danielhanchen

If this issue still has not been resolved feel free to make a new issue but ill be closing for now!

Jun 06 '25 19:06 shimmyshimmer

I encountered the same issue as you did. I checked all the installation versions on the official Colab and ensured that they were consistent, but the problem still persisted. Eventually, I set vllm_cache=True and found that the model could run normally and generate proper sequences. To be more specific, the settings are as follows:
model, tokenizer = FastLanguageModel.from_pretrained(
        model_name=model_path,
        max_seq_length=max_seq_length,
        load_in_4bit=True,        
        fast_inference=True,      # set True if you want vLLM fast inference
        max_lora_rank=lora_rank,
        gpu_memory_utilization=0.7
    )

training_args_lyrics = GRPOConfig(
    use_vllm = True, # use vLLM for fast inference!
    learning_rate = 5e-6,
    adam_beta1 = 0.9,
    adam_beta2 = 0.99,
    weight_decay = 0.1,
    warmup_ratio = 0.1,
    lr_scheduler_type = "cosine",
    optim = "paged_adamw_8bit",
    logging_steps = 1,
    bf16 = is_bfloat16_supported(),
    fp16 = not is_bfloat16_supported(),
    per_device_train_batch_size = 4,
    gradient_accumulation_steps = 2, # Increase to 4 for smoother training
    num_generations = 8, # Decrease if out of memory
    max_prompt_length = 768,
    max_completion_length = 768,
    num_train_epochs = 2, # Set to 1 for a full training run
    # max_steps = 50,
    save_steps = 250,
    max_grad_norm = 0.1,
    report_to = "none", # Can use Weights & Biases
    output_dir = "outputs_lyrics_phase",
)
With these settings, the program runs smoothly. It seems that the current models only support vllm-based gradient backpropagation. Without enabling vllm_cache, the first batch of data might be normal, but subsequent batches often encounter repetitive issues. However, once vllm_cache is turned on, the aforementioned problems are resolved!

How do you turn on vllm_cache? It's not in your code snippet

Aug 19 '25 11:08 mikeknapp

unsloth unsloth copied to clipboard

GRPOTrainer example works with trl but generate "noise" with unsloth

unsloth
unsloth copied to clipboard