unsloth GRPO training error

I'm training Llama-3.2-1B-Instruct at commit https://github.com/unslothai/unsloth/commit/2c0f50160e227936e0011d67e3bc2472c2089629 and my code is from https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb just change model to Llama-3.2-1B-Instruct since I don't have much sources

I'm running in a docker environment with CUDA=12.1

torch 2.5.1 unsloth 2025.2.15 unsloth_zoo 2025.2.7

commit https://github.com/unslothai/unsloth/commit/512fec6a7b77a930b85a5b5685bf056fbb29ff5e works for me commit https://github.com/unslothai/unsloth/commit/179840d3a7b49188c372b56c67c4290d53c29ed6 still have save error

here is my code:

from unsloth import FastLanguageModel, PatchFastRL
PatchFastRL("GRPO", FastLanguageModel)


from unsloth import is_bfloat16_supported
import torch
max_seq_length = 512 # Can increase for longer reasoning traces
lora_rank = 8 # Larger rank = smarter, but slower

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "../llm_test/many_test/models/Llama-3.2-1B-Instruct/",
    # model_name = "../llm_test/many_test/models/Qwen2.5-0.5B-Instruct/",
    max_seq_length = max_seq_length,
    load_in_4bit = True, # False for LoRA 16bit
    fast_inference = False, # Enable vLLM fast inference
    max_lora_rank = lora_rank,
    gpu_memory_utilization = 0.6, # Reduce if out of memory
)

model = FastLanguageModel.get_peft_model(
    model,
    r = lora_rank, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ], # Remove QKVO if out of memory
    lora_alpha = lora_rank,
    use_gradient_checkpointing = "unsloth", # Enable long context finetuning
    random_state = 3407,
)


import re
from datasets import load_dataset, Dataset
from modelscope.msdatasets import MsDataset

# Load and prep dataset
SYSTEM_PROMPT = """
Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>
"""

XML_COT_FORMAT = """\
<reasoning>
{reasoning}
</reasoning>
<answer>
{answer}
</answer>
"""

def extract_xml_answer(text: str) -> str:
    answer = text.split("<answer>")[-1]
    answer = answer.split("</answer>")[0]
    return answer.strip()

def extract_hash_answer(text: str) -> str | None:
    if "####" not in text:
        return None
    return text.split("####")[1].strip()

# uncomment middle messages for 1-shot prompting
def get_gsm8k_questions(split = "train") -> Dataset:
    # data = load_dataset('openai/gsm8k', 'main')[split] # type: ignore
    data =  MsDataset.load('modelscope/gsm8k', subset_name='main', split=split)
    data = data.map(lambda x: { # type: ignore
        'prompt': [
            {'role': 'system', 'content': SYSTEM_PROMPT},
            {'role': 'user', 'content': x['question']}
        ],
        'answer': extract_hash_answer(x['answer'])
    }) # type: ignore
    return data # type: ignore

dataset = get_gsm8k_questions()

# Reward functions
def correctness_reward_func(prompts, completions, answer, **kwargs) -> list[float]:
    responses = [completion[0]['content'] for completion in completions]
    q = prompts[0][-1]['content']
    extracted_responses = [extract_xml_answer(r) for r in responses]
    print('-'*20, f"Question:\n{q}", f"\nAnswer:\n{answer[0]}", f"\nResponse:\n{responses[0]}", f"\nExtracted:\n{extracted_responses[0]}")
    return [2.0 if r == a else 0.0 for r, a in zip(extracted_responses, answer)]

def int_reward_func(completions, **kwargs) -> list[float]:
    responses = [completion[0]['content'] for completion in completions]
    extracted_responses = [extract_xml_answer(r) for r in responses]
    return [0.5 if r.isdigit() else 0.0 for r in extracted_responses]

def strict_format_reward_func(completions, **kwargs) -> list[float]:
    """Reward function that checks if the completion has a specific format."""
    pattern = r"^<reasoning>\n.*?\n</reasoning>\n<answer>\n.*?\n</answer>\n$"
    responses = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, r) for r in responses]
    return [0.5 if match else 0.0 for match in matches]

def soft_format_reward_func(completions, **kwargs) -> list[float]:
    """Reward function that checks if the completion has a specific format."""
    pattern = r"<reasoning>.*?</reasoning>\s*<answer>.*?</answer>"
    responses = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, r) for r in responses]
    return [0.5 if match else 0.0 for match in matches]

def count_xml(text) -> float:
    count = 0.0
    if text.count("<reasoning>\n") == 1:
        count += 0.125
    if text.count("\n</reasoning>\n") == 1:
        count += 0.125
    if text.count("\n<answer>\n") == 1:
        count += 0.125
        count -= len(text.split("\n</answer>\n")[-1])*0.001
    if text.count("\n</answer>") == 1:
        count += 0.125
        count -= (len(text.split("\n</answer>")[-1]) - 1)*0.001
    return count

def xmlcount_reward_func(completions, **kwargs) -> list[float]:
    contents = [completion[0]["content"] for completion in completions]
    return [count_xml(c) for c in contents]


from trl import GRPOConfig, GRPOTrainer
training_args = GRPOConfig(
    use_vllm = False, # use vLLM for fast inference!
    learning_rate = 5e-6,
    adam_beta1 = 0.9,
    adam_beta2 = 0.99,
    weight_decay = 0.1,
    warmup_ratio = 0.1,
    lr_scheduler_type = "cosine",
    optim = "paged_adamw_8bit",
    logging_steps = 1,
    bf16 = is_bfloat16_supported(),
    fp16 = not is_bfloat16_supported(),
    per_device_train_batch_size = 1,
    gradient_accumulation_steps = 1, # Increase to 4 for smoother training
    num_generations = 6, # Decrease if out of memory
    max_prompt_length = 256,
    max_completion_length = 200,
    # num_train_epochs = 1, # Set to 1 for a full training run
    max_steps =  250,
    save_steps = 250,
    max_grad_norm = 0.1,
    report_to = "none", # Can use Weights & Biases
    output_dir = "outputs",
)


trainer = GRPOTrainer(
    model = model,
    processing_class = tokenizer,
    reward_funcs = [
        xmlcount_reward_func,
        soft_format_reward_func,
        strict_format_reward_func,
        int_reward_func,
        correctness_reward_func,
    ],
    args = training_args,
    train_dataset = dataset,
)
trainer.train()

full log

root@c0410db6a918:/code/unsloth_20250226# python tmp.py 🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning. 🦥 Unsloth Zoo will now patch everything to make training faster! INFO 02-26 09:35:07 init.py:190] Automatically detected platform cuda. ==((====))== Unsloth 2025.2.15: Fast Llama patching. Transformers: 4.49.0. \ /| GPU: NVIDIA GeForce RTX 3090. Max memory: 23.691 GB. Platform: Linux. O^O/ _/ \ Torch: 2.5.1+cu124. CUDA: 8.6. CUDA Toolkit: 12.4. Triton: 3.1.0 \ / Bfloat16 = TRUE. FA [Xformers = 0.0.29.post1. FA2 = True] "--" Free Apache license: http://github.com/unslothai/unsloth Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored! ../llm_test/many_test/models/Llama-3.2-1B-Instruct/ does not have a padding token! Will use pad_token = <|finetune_right_pad_id|>. Unsloth 2025.2.15 patched 16 layers with 16 QKV layers, 16 O layers and 16 MLP layers. 2025-02-26 09:36:15,980 - modelscope - WARNING - Use trust_remote_code=True. Will invoke codes from gsm8k. Please make sure that you can trust the external codes. 2025-02-26 09:36:16,418 - modelscope - WARNING - Use trust_remote_code=True. Will invoke codes from modelscope/gsm8k. Please make sure that you can trust the external codes. 2025-02-26 09:36:16,418 - modelscope - WARNING - Use trust_remote_code=True. Will invoke codes from modelscope/gsm8k. Please make sure that you can trust the external codes. 2025-02-26 09:36:16,419 - modelscope - WARNING - Use trust_remote_code=True. Will invoke codes from modelscope/gsm8k. Please make sure that you can trust the external codes. Detected kernel version 4.15.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. ==((====))== Unsloth - 2x faster free finetuning | Num GPUs = 1 \ /| Num examples = 7,473 | Num Epochs = 1 O^O/ _/ \ Batch size per device = 1 | Gradient Accumulation steps = 1 \ / Total batch size = 1 | Total steps = 250 "--" Number of trainable parameters = 5,636,096 0%| | 0/250 [00:00<?, ?it/s]Traceback (most recent call last): File "/code/unsloth_20250226/tmp.py", line 168, in trainer.train() File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 2241, in train return inner_training_loop( ^^^^^^^^^^^^^^^^^^^^ File "", line 329, in _fast_inner_training_loop File "", line 31, in _unsloth_training_step File "/code/unsloth_20250226/unsloth_compiled_cache/UnslothGRPOTrainer.py", line 766, in compute_loss prompt_ids, prompt_mask = inputs["prompt_ids"], inputs["prompt_mask"] ~~~~~~^^^^^^^^^^^^^^ TypeError: list indices must be integers or slices, not str 0%| | 0/250 [00:00<?, ?it/s]

how can i fix this?

Feb 26 '25 09:02 xudou3

Do you know what version of TRL you are using?

Feb 26 '25 12:02 danielhanchen

Do you know what version of TRL you are using?

trl 0.14.0

after update trl to 0.15.2 it can work, but it seems that model output is incorrect. Looks like this:

root@c0410db6a918:/code/unsloth_20250226# python tmp.py 🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning. 🦥 Unsloth Zoo will now patch everything to make training faster! INFO 02-26 12:12:28 init.py:190] Automatically detected platform cuda. ==((====))== Unsloth 2025.2.15: Fast Llama patching. Transformers: 4.49.0. \ /| GPU: NVIDIA GeForce RTX 3090. Max memory: 23.691 GB. Platform: Linux. O^O/ _/ \ Torch: 2.5.1+cu124. CUDA: 8.6. CUDA Toolkit: 12.4. Triton: 3.1.0 \ / Bfloat16 = TRUE. FA [Xformers = 0.0.29.post1. FA2 = True] "--" Free Apache license: http://github.com/unslothai/unsloth Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored! ../llm_test/many_test/models/Llama-3.2-1B-Instruct/ does not have a padding token! Will use pad_token = <|finetune_right_pad_id|>. Unsloth 2025.2.15 patched 16 layers with 16 QKV layers, 16 O layers and 16 MLP layers. 2025-02-26 12:13:37,616 - modelscope - WARNING - Use trust_remote_code=True. Will invoke codes from gsm8k. Please make sure that you can trust the external codes. 2025-02-26 12:13:38,005 - modelscope - WARNING - Use trust_remote_code=True. Will invoke codes from modelscope/gsm8k. Please make sure that you can trust the external codes. 2025-02-26 12:13:38,006 - modelscope - WARNING - Use trust_remote_code=True. Will invoke codes from modelscope/gsm8k. Please make sure that you can trust the external codes. 2025-02-26 12:13:38,006 - modelscope - WARNING - Use trust_remote_code=True. Will invoke codes from modelscope/gsm8k. Please make sure that you can trust the external codes. Unsloth: We now expect per_device_train_batch_size to be a multiple of num_generations. We will change the batch size of 1 to the num_generations of 6 Detected kernel version 4.15.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. ==((====))== Unsloth - 2x faster free finetuning | Num GPUs = 1 \ /| Num examples = 7,473 | Num Epochs = 1 O^O/ _/ \ Batch size per device = 6 | Gradient Accumulation steps = 1 \ / Total batch size = 6 | Total steps = 250 "--" Number of trainable parameters = 5,636,096 0%| | 0/250 [00:00<?, ?it/s]-------------------- Question: A concert ticket costs $40. Mr. Benson bought 12 tickets and received a 5% discount for every ticket bought that exceeds 10. How much did Mr. Benson pay in all? Answer: 476 Response: Mr. Benson bought 12 tickets. The first 10 tickets are normal price, and the remaining 2 tickets are bought with a 5% discount.

Discount on 10 tickets = 5% of (10*12) = 0.05 * 120 = $6

Discount on 2 tickets = 5% of (2*40) = 0.05 * 80 = $4

Total discount = $6 + $4 = $10

Now, we calculate the total price:

Discounted price for 10 tickets = 10 * $40 = $400

Discounted price for remaining 2 tickets = 2 * ($40 - $10) = 2 * $30 = $60

Total price for 12 tickets = $400 + $60 = $460

Mr. Benson paid: $460. Extracted: Mr. Benson bought 12 tickets. The first 10 tickets are normal price, and the remaining 2 tickets are bought with a 5% discount.

Discount on 10 tickets = 5% of (10*12) = 0.05 * 120 = $6

Discount on 2 tickets = 5% of (2*40) = 0.05 * 80 = $4

Total discount = $6 + $4 = $10

Now, we calculate the total price:

Discounted price for 10 tickets = 10 * $40 = $400

Discounted price for remaining 2 tickets = 2 * ($40 - $10) = 2 * $30 = $60

Total price for 12 tickets = $400 + $60 = $460

Mr. Benson paid: $460. {'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 2.0000000000000002e-07, 'rewards/xmlcount_reward_func': 0.0, 'rewards/soft_format_reward_func': 0.0, 'rewards/strict_format_reward_func': 0.0, 'rewards/int_reward_func': 0.0, 'rewards/correctness_reward_func': 0.0, 'reward': 0.0, 'reward_std': 0.0, 'completion_length': 181.1666717529297, 'kl': 0.0, 'epoch': 0.0} 0%|▋ | 1/250 [00:16<1:06:42, 16.08s/it]-------------------- Question: Jane is trying to decide whether to buy a house or a trailer. A house costs $480,000 and a trailer costs $120,000. Each loan will be paid in monthly installments over 20 years. How much more is the monthly payment on the house compared to the trailer? Answer: 1500 Response: oteppureisteristeristeristeristeristeristerureureisterureisterureureureureisterureureureureisterureureureureisterureureureureureureisterureureureureureureureureureutorgureureureureureureureureureureureureureureureureureutoureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureisterureureureureureureureureureureureisteristerureureutoureureureureureisterureureureureureureureureureureureureureureisterureureureureureureureureureisterureureureureureureureureureureureureisterureureureutoureureureutoureureureureutoureutoureutoureureureureutoutoutoutoutoutoutoutoutoutoisterureureure Extracted: oteppureisteristeristeristeristeristeristerureureisterureisterureureureureisterureureureureisterureureureureisterureureureureureureisterureureureureureureureureureutorgureureureureureureureureureureureureureureureureureutoureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureisterureureureureureureureureureureureisteristerureureutoureureureureureisterureureureureureureureureureureureureureureisterureureureureureureureureureisterureureureureureureureureureureureureisterureureureutoureureureutoureureureureutoureutoureutoureureureureutoutoutoutoutoutoutoutoutoutoisterureureure {'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 4.0000000000000003e-07, 'rewards/xmlcount_reward_func': 0.0, 'rewards/soft_format_reward_func': 0.0, 'rewards/strict_format_reward_func': 0.0, 'rewards/int_reward_func': 0.0, 'rewards/correctness_reward_func': 0.0, 'reward': 0.0, 'reward_std': 0.0, 'completion_length': 200.0, 'kl': 0.0, 'epoch': 0.0} 1%|█▍ | 2/250 [00:21<40:38, 9.83s/it]-------------------- Question: Janet pays $40/hour for 3 hours per week of clarinet lessons and $28/hour for 5 hours a week of piano lessons. How much more does she spend on piano lessons than clarinet lessons in a year? Answer: 1040 Response: resure�ldureisteristeristeristeristerrgrgrgisterrgureisteristerureisterureureureisterureureureureureisterureureureureureureureutoisterureureureureureisterureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureutoureureureureureureureureureureureureisterureureutoureureureureureureureureureureisterureureureureureureureurergureutoutoutoutoutoutoutoureutoureureureureureureureureureureureutoureureutoureutoureureureureureureureureureureureureureureureureureureureureureureutoureureureureureureureureureisterureureureureureureureureureutoureureureureisterureisterureureargureutoureutoutouto Extracted: resure�ldureisteristeristeristeristerrgrgrgisterrgureisteristerureisterureureureisterureureureureureisterureureureureureureureutoisterureureureureureisterureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureutoureureureureureureureureureureureureisterureureutoureureureureureureureureureureisterureureureureureureureurergureutoutoutoutoutoutoutoureutoureureureureureureureureureureureutoureureutoureutoureureureureureureureureureureureureureureureureureureureureureureutoureureureureureureureureureisterureureureureureureureureureutoureureureureisterureisterureureargureutoureutoutouto {'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 6.000000000000001e-07, 'rewards/xmlcount_reward_func': 0.0, 'rewards/soft_format_reward_func': 0.0, 'rewards/strict_format_reward_func': 0.0, 'rewards/int_reward_func': 0.0, 'rewards/correctness_reward_func': 0.0, 'reward': 0.0, 'reward_std': 0.0, 'completion_length': 200.0, 'kl': 0.0, 'epoch': 0.0} 1%|██ | 3/250 [00:27<32:46, 7.96s/it]-------------------- Question: Sabrina is collecting herbs to make a poultice for her grandmother. She needs twice as many basil leaves as sage leaves and 5 fewer sage leaves than verbena leaves. If she needs 12 basil leaves, how many leaves total does she need? Answer: 29 Response: roidppisteristeristeristeristeristeristeristeristerureisterureisterureisterureureureisterureisterureureureureureisterureureureureureureureureureureureureureureureureureisterureureureureureisterureureuto

rgureisterureutoisterureisteristerureisterureutoisterureutoisteristerureisterureutoutoutoutoutoutoutoutoutoisterureutoutoutoutoutoutoisterureisteristerutoutoutoutoutoisteristerutoureutoutoutoutoutoutoutoisterureisterureutoisterureisteristerutoutoutoutoutoutoisteristerutoureisterureutoutoutoutoutoutoutoutoutoutoutoisteristerutoutoutoutoutoutoutoutoutoutoutoutoutoutoutoutoutoutoutoutoutoutoutoutoutoutoutoutoutoutoutoutoutoutoutoutoutoutoutoutoutoutoutoutoutoutoutoutoutoutouto Extracted: roidppisteristeristeristeristeristeristeristeristerureisterureisterureisterureureureisterureisterureureureureureisterureureureureureureureureureureureureureureureureureisterureureureureureisterureureuto

rgureisterureutoisterureisteristerureisterureutoisterureutoisteristerureisterureutoutoutoutoutoutoutoutoutoisterureutoutoutoutoutoutoisterureisteristerutoutoutoutoutoisteristerutoureutoutoutoutoutoutoutoisterureisterureutoisterureisteristerutoutoutoutoutoutoisteristerutoureisterureutoutoutoutoutoutoutoutoutoutoutoisteristerutoutoutoutoutoutoutoutoutoutoutoutoutoutoutoutoutoutoutoutoutoutoutoutoutoutoutoutoutoutoutoutoutoutoutoutoutoutoutoutoutoutoutoutoutoutoutoutoutoutouto {'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 8.000000000000001e-07, 'rewards/xmlcount_reward_func': 0.0, 'rewards/soft_format_reward_func': 0.0, 'rewards/strict_format_reward_func': 0.0, 'rewards/int_reward_func': 0.0, 'rewards/correctness_reward_func': 0.0, 'reward': 0.0, 'reward_std': 0.0, 'completion_length': 200.0, 'kl': 0.0, 'epoch': 0.0} 2%|██▊ | 4/250 [00:32<28:54, 7.05s/it]-------------------- Question: Over the past five years, on July 4th, the high temperature for Washington, DC has been: 90 degrees in 2020, 90 degrees in 2019, 90 degrees in 2018, 79 degrees in 2017 and 71 degrees in 2016. What is the average temperature for July 4th in Washington, DC over the past 5 years? Answer: 84 Response: Keyynppurergurergureisteristeristeristeristeristeristerureureisterureureureisterureureureureisterureisterureureureisterureurergureureureureureutorgureureisterureureureureureureureureureureureisterureureureureisterureureureureureureureisterureureureureureureureureureureureureureureureureureureutoisterureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureutoureutoureureureureureureicesureureureureureureureisterureureureureureureureureureureutoureutoureureutoureureureureureutoutoutoutoutoutoutoutoutoutoisterureureureutoureureureureureisterureutoureureureureureutoureutoureutoureureureutoureisterure Extracted: Keyynppurergurergureisteristeristeristeristeristeristerureureisterureureureisterureureureureisterureisterureureureisterureurergureureureureureutorgureureisterureureureureureureureureureureureisterureureureureisterureureureureureureureisterureureureureureureureureureureureureureureureureureureutoisterureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureutoureutoureureureureureureicesureureureureureureureisterureureureureureureureureureureutoureutoureureutoureureureureureutoutoutoutoutoutoutoutoutoutoisterureureureutoureureureureureisterureutoureureureureureutoureutoureutoureureureutoureisterure {'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 1.0000000000000002e-06, 'rewards/xmlcount_reward_func': 0.0, 'rewards/soft_format_reward_func': 0.0, 'rewards/strict_format_reward_func': 0.0, 'rewards/int_reward_func': 0.0, 'rewards/correctness_reward_func': 0.0, 'reward': 0.0, 'reward_std': 0.0, 'completion_length': 200.0, 'kl': 0.0, 'epoch': 0.0} 2%|███▍ | 5/250 [00:38<26:52, 6.58s/it]-------------------- Question: Rene can finish reading 30 pages in 60 minutes. Lulu can read 27 pages in 60 minutes and Cherry can read 25 pages in 60 minutes. If they have been reading for 240 minutes now, how many pages have they finished reading in total? Answer: 328 Response: resureureisteristeristeristeristeristeristeristeristeristeristeristerureisteristerureureureisterureureureisterureureisterureureureureureureureureureureureureureureureureureureisterureisterureureureureureureureutoisterureureureureureureisterureisterureureureureureureureureureureureureureisterureureureureureureisterureureureureureisterureisterureureutoureutoureureureureureisterureureureureureisterureureutoureutoureureureisterureureureisterureureureureisterureisteristerureureureisterister doesisteristeristeristerureureureureutoureureureisterureisteristeristeristeristeristeristeristeristeristeristeristeristeristeristeristeristeristeristeristeristeristeristerureisteristeristeristeristeristeristeristerureisterureisterureutoutoutoutoutoutoutouto Extracted: resureureisteristeristeristeristeristeristeristeristeristeristeristerureisteristerureureureisterureureureisterureureisterureureureureureureureureureureureureureureureureureureisterureisterureureureureureureureutoisterureureureureureureisterureisterureureureureureureureureureureureureureisterureureureureureureisterureureureureureisterureisterureureutoureutoureureureureureisterureureureureureisterureureutoureutoureureureisterureureureisterureureureureisterureisteristerureureureisterister doesisteristeristeristerureureureureutoureureureisterureisteristeristeristeristeristeristeristeristeristeristeristeristeristeristeristeristeristeristeristeristeristeristerureisteristeristeristeristeristeristeristerureisterureisterureutoutoutoutoutoutoutouto {'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 1.2000000000000002e-06, 'rewards/xmlcount_reward_func': 0.0, 'rewards/soft_format_reward_func': 0.0, 'rewards/strict_format_reward_func': 0.0, 'rewards/int_reward_func': 0.0, 'rewards/correctness_reward_func': 0.0, 'reward': 0.0, 'reward_std': 0.0, 'completion_length': 200.0, 'kl': 0.0, 'epoch': 0.0} 2%|████▏ | 6/250 [00:44<25:39, 6.31s/it]-------------------- Question: Martin rings the small bell 4 times more than 1/3 as often as the big bell. If he rings both of them a combined total of 52 times, how many times does he ring the big bell? Answer: 36 Response: .Dureppure(nure64ister(nureisteristeristeristeristerrgrgisteristeristerrgisterureisterrgureisterrgureisterureureisterureureisterureisterureisteristerureureureure Vureurergurergureurergureisterrgurergurergureisterureisterureisterurergureureisterureureisterureureureisteristerrgureureisterureisteristerureureureisterureureurergureureureisterureureurergure Vureureureureure(nureureurergureureureureureureureureisterureureureisterureureureisterureureureureurergureureureureureisterureureure_inisterureutoisterureureisterureisterureureisteristeristeristeristeristeristerureisterureisteristerureisterureisteristerureisterureisteristerureisteristeristeristeristerureisterureutoisteristerureutoutoisterureutoure Extracted: .Dureppure(nure64ister(nureisteristeristeristeristerrgrgisteristeristerrgisterureisterrgureisterrgureisterureureisterureureisterureisterureisteristerureureureure Vureurergurergureurergureisterrgurergurergureisterureisterureisterurergureureisterureureisterureureureisteristerrgureureisterureisteristerureureureisterureureurergureureureisterureureurergure Vureureureureure(nureureurergureureureureureureureureisterureureureisterureureureisterureureureureurergureureureureureisterureureure_inisterureutoisterureureisterureisterureureisteristeristeristeristeristeristerureisterureisteristerureisterureisteristerureisterureisteristerureisteristeristeristeristerureisterureutoisteristerureutoutoisterureutoure {'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 1.4000000000000001e-06, 'rewards/xmlcount_reward_func': 0.0, 'rewards/soft_format_reward_func': 0.0, 'rewards/strict_format_reward_func': 0.0, 'rewards/int_reward_func': 0.0, 'rewards/correctness_reward_func': 0.0, 'reward': 0.0, 'reward_std': 0.0, 'completion_length': 200.0, 'kl': 0.0, 'epoch': 0.0} 3%|████▊ | 7/250 [00:50<24:47, 6.12s/it]-------------------- Question: Bert fills out the daily crossword puzzle in the newspaper every day. He uses up a pencil to fill out the puzzles every two weeks. On average, it takes him 1050 words to use up a pencil. How many words are in each crossword puzzle on average? Answer: 75 Response: { rgureppureisteristeristeristeristeristeristerureureureisterureureisterureureureureureureureureureureureureureisterureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureutorgureureureureureureureureureureureureureureureureureureureureureurergureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureure Resureureureureureureureureureureureureureureureisterureureureureureureureureureureisterureureureureureureureureureureureureureureureurergureureureureureurergureureure Extracted: { rgureppureisteristeristeristeristeristeristerureureureisterureureisterureureureureureureureureureureureureureisterureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureutorgureureureureureureureureureureureureureureureureureureureureureurergureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureure Resureureureureureureureureureureureureureureureisterureureureureureureureureureureisterureureureureureureureureureureureureureureureurergureureureureureurergureureure {'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 1.6000000000000001e-06, 'rewards/xmlcount_reward_func': 0.0, 'rewards/soft_format_reward_func': 0.0, 'rewards/strict_format_reward_func': 0.0, 'rewards/int_reward_func': 0.0, 'rewards/correctness_reward_func': 0.0, 'reward': 0.0, 'reward_std': 0.0, 'completion_length': 200.0, 'kl': 0.0, 'epoch': 0.0} 3%|█████▌ | 8/250 [00:55<24:07, 5.98s/it]-------------------- Question: Matt can make a batch of a dozen cookies using 2 pounds of flour. He uses 4 bags of flour each weighing 5 pounds. If Jim eats 15 cookies how many cookies are left? Answer: 105 Response: relureisteristeristeristeristeristeristeristeristerureureureisteristerrgureisterureureureureisteristeristerureisterureureisterureureureureisterureureureisterureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureutoppureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureisterureureureureureureureureisterureureureureureureureureisterureureureureureureureureureureureureureureureureureureureureureutoureureutorgureureureureureureisterureureureureureutoureureutoureureisterureureureutoureureisterureisterrgureutoisterureureureureureutoureureureureureureutoureure Extracted: relureisteristeristeristeristeristeristeristeristerureureureisteristerrgureisterureureureureisteristeristerureisterureureisterureureureureisterureureureisterureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureutoppureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureureisterureureureureureureureureisterureureureureureureureureisterureureureureureureureureureureureureureureureureureureureureureutoureureutorgureureureureureureisterureureureureureutoureureutoureureisterureureureutoureureisterureisterrgureutoisterureureureureureutoureureureureureureutoureure {'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 1.8000000000000001e-06, 'rewards/xmlcount_reward_func': 0.0, 'rewards/soft_format_reward_func': 0.0, 'rewards/strict_format_reward_func': 0.0, 'rewards/int_reward_func': 0.0, 'rewards/correctness_reward_func': 0.0, 'reward': 0.0, 'reward_std': 0.0, 'completion_length': 200.0, 'kl': 0.0, 'epoch': 0.0} 4%|██████▎ | 9/250 [01:01<23:24, 5.83s/it]-------------------- Question: James decides to build a tin house by collecting 500 tins in a week. On the first day, he collects 50 tins. On the second day, he manages to collect 3 times that number. On the third day, he collects 50 tins fewer than the number he collected on the second day. If he collects an equal number of tins on the remaining days of the week, what's the number of tins he collected each day for the rest of the week? Answer: 50 Response: rgureppureisteristeristeristeristeristerrgisteristerureureureisterureureureureisterurergureisterurergureureureureureureutorgureureureureureureureureureisterureureureureureisterureureureureureureureureureureureureureureureureureureureureureureureureureureure Resureureureureureureureureureureurergureureureureureureureureureureureurergureureurergureureureureureurergureureureureureisterureureurergureureureisterureureureureutoisterurergureureureureurergureureureisterure consureurergureurergureisterureureureurergure_inisteristerutoutoisteristeristerureisteristeristeristeristeristeristeristeristeristeristeristeristeristeristeristeristeristeristeristeristeristeristeristeristeristerister Extracted: rgureppureisteristeristeristeristeristerrgisteristerureureureisterureureureureisterurergureisterurergureureureureureureutorgureureureureureureureureureisterureureureureureisterureureureureureureureureureureureureureureureureureureureureureureureureureureure Resureureureureureureureureureureurergureureureureureureureureureureureurergureureurergureureureureureurergureureureureureisterureureurergureureureisterureureureureutoisterurergureureureureurergureureureisterure consureurergureurergureisterureureureurergure_inisteristerutoutoisteristeristerureisteristeristeristeristeristeristeristeristeristeristeristeristeristeristeristeristeristeristeristeristeristeristeristeristeristerister {'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 2.0000000000000003e-06, 'rewards/xmlcount_reward_func': 0.0, 'rewards/soft_format_reward_func': 0.0, 'rewards/strict_format_reward_func': 0.0, 'rewards/int_reward_func': 0.0, 'rewards/correctness_reward_func': 0.0, 'reward': 0.0, 'reward_std': 0.0, 'completion_length': 200.0, 'kl': 0.0, 'epoch': 0.0} 4%|██████▉ | 10/250 [01:06<22:56, 5.74s/it]-------------------- Question: A jar of jellybeans has 14 blue jellybeans, 26 purple jellybeans and 40 orange jellybeans. If there are 200 jellybeans in the jar, how many are there of the red color? Answer: 120 Response: ADcondureisteristerrgurergureisteristeristeristeristeristeristerureureisterureisteristerureureurergurergureureureureisterureureureureureureureutorgureureureureureisterureisterureureureureureureureutorgureureureureureureisterureisterureureureureisterureureutoureisterureureureureureureureureureureureureisteristerureureureureureureureutoureureutoureureureureisterureureureureureureureureurergureutoureureureureureisterureutoureureisterureureureureureutoureureureisterurergureureisterureureureutoureureureisterureureureisterureureisterureisterurergureutoisterureureureisterureureureureureisterureisterurergure_inisterutoisteristerutoutoutoutoutoutoutoisteristerutoutouto Extracted: ADcondureisteristerrgurergureisteristeristeristeristeristeristerureureisterureisteristerureureurergurergureureureureisterureureureureureureureutorgureureureureureisterureisterureureureureureureureutorgureureureureureureisterureisterureureureureisterureureutoureisterureureureureureureureureureureureureisteristerureureureureureureureutoureureutoureureureureisterureureureureureureureureurergureutoureureureureureisterureutoureureisterureureureureureutoureureureisterurergureureisterureureureutoureureureisterureureureisterureureisterureisterurergureutoisterureureureisterureureureureureisterureisterurergure_inisterutoisteristerutoutoutoutoutoutoutoisteristerutoutouto {'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 2.2e-06, 'rewards/xmlcount_reward_func': 0.0, 'rewards/soft_format_reward_func': 0.0, 'rewards/strict_format_reward_func': 0.0, 'rewards/int_reward_func': 0.0, 'rewards/correctness_reward_func': 0.0, 'reward': 0.0, 'reward_std': 0.0, 'completion_length': 200.0, 'kl': 0.0, 'epoch': 0.0} 4%|███████▌ | 11/250 [01:12<22:30, 5.65s/it]-------------------- Question: Five adults and two children go to see a movie and buy $12 worth of concessions. The total cost of their trip is $76. If each child's ticket is $7, how much, in dollars, are the adult tickets? Answer: 10 Response: lineERisterrgrgrgisterrgureisterrgrgisterrgisterrgurergureisterrgureisterrgureisteristerrgurergureister consureureisterrgureutorgurergureureureurergureureister checkrgurergureureisterrgureurergureurergureureisterrgureureureureureurergureisterrgureureisterrgureisteristerrgureureureisterLOrgureureureisterrgureureureisteristerrgureureisterrgureisteristerrgureureureisterrgureisterrgureisteristerrgure Resureisterureureureureureisterrgureureisterrgureureurergureisterrgisterrgurergureisterrgurergureisterrgureisterrgureisteristerrgureisterrgurergureisterrgureisterrgureisteristeristerrgrgureisteristeristeristerrgrgureisteristerrgureisteristeristeristeristerrgure Extracted: lineERisterrgrgrgisterrgureisterrgrgisterrgisterrgurergureisterrgureisterrgureisteristerrgurergureister consureureisterrgureutorgurergureureureurergureureister checkrgurergureureisterrgureurergureurergureureisterrgureureureureureurergureisterrgureureisterrgureisteristerrgureureureisterLOrgureureureisterrgureureureisteristerrgureureisterrgureisteristerrgureureureisterrgureisterrgureisteristerrgure Resureisterureureureureureisterrgureureisterrgureureurergureisterrgisterrgurergureisterrgurergureisterrgureisterrgureisteristerrgureisterrgurergureisterrgureisterrgureisteristeristerrgrgureisteristeristeristerrgrgureisteristerrgureisteristeristeristeristerrgure {'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 2.4000000000000003e-06, 'rewards/xmlcount_reward_func': 0.0, 'rewards/soft_format_reward_func': 0.0, 'rewards/strict_format_reward_func': 0.0, 'rewards/int_reward_func': 0.0, 'rewards/correctness_reward_func': 0.0, 'reward': 0.0, 'reward_std': 0.0, 'completion_length': 200.0, 'kl': 0.0, 'epoch': 0.0} 5%|████████▎ | 12/250 [01:17<22:05, 5.57s/it]

Feb 26 '25 12:02 xudou3

The issue happens to be within the python version you are using. If you use python 3.11 it will work. But it is possible that you will have an issue with the import of "_lzma". Follow the answer in this with the right python version. Should fix the issue.https://askubuntu.com/a/1501705

Feb 26 '25 18:02 kings-crown

Not using vLLM correct? Hmmm I'll verify batched inference - maybe something broke

Feb 26 '25 21:02 danielhanchen

The issue happens to be within the python version you are using. If you use python 3.11 it will work. But it is possible that you will have an issue with the import of "_lzma". Follow the answer in this with the right python version. Should fix the issue.https://askubuntu.com/a/1501705

Thank you, I try it but not work for me. is there anything related to _lzma?

Feb 27 '25 02:02 xudou3

Oh no I don't think thats correct - better wait for my fix!

Feb 27 '25 03:02 danielhanchen

I encountered the same issue as you did. I checked all the installation versions on the official Colab and ensured that they were consistent, but the problem still persisted. Eventually, I set vllm_cache=True and found that the model could run normally and generate proper sequences. To be more specific, the settings are as follows:

model, tokenizer = FastLanguageModel.from_pretrained(
        model_name=model_path,
        max_seq_length=max_seq_length,
        load_in_4bit=True,        
        fast_inference=True,      # set True if you want vLLM fast inference
        max_lora_rank=lora_rank,
        gpu_memory_utilization=0.7
    )

training_args_lyrics = GRPOConfig(
    use_vllm = True, # use vLLM for fast inference!
    learning_rate = 5e-6,
    adam_beta1 = 0.9,
    adam_beta2 = 0.99,
    weight_decay = 0.1,
    warmup_ratio = 0.1,
    lr_scheduler_type = "cosine",
    optim = "paged_adamw_8bit",
    logging_steps = 1,
    bf16 = is_bfloat16_supported(),
    fp16 = not is_bfloat16_supported(),
    per_device_train_batch_size = 4,
    gradient_accumulation_steps = 2, # Increase to 4 for smoother training
    num_generations = 8, # Decrease if out of memory
    max_prompt_length = 768,
    max_completion_length = 768,
    num_train_epochs = 2, # Set to 1 for a full training run
    # max_steps = 50,
    save_steps = 250,
    max_grad_norm = 0.1,
    report_to = "none", # Can use Weights & Biases
    output_dir = "outputs_lyrics_phase",
)

With these settings, the program runs smoothly. It seems that the current models only support vllm-based gradient backpropagation. Without enabling vllm_cache, the first batch of data might be normal, but subsequent batches often encounter repetitive issues. However, once vllm_cache is turned on, the aforementioned problems are resolved!

Feb 27 '25 14:02 StarLight1212

I encountered the same issue as you did. I checked all the installation versions on the official Colab and ensured that they were consistent, but the problem still persisted. Eventually, I set vllm_cache=True and found that the model could run normally and generate proper sequences. To be more specific, the settings are as follows:

model, tokenizer = FastLanguageModel.from_pretrained( model_name=model_path, max_seq_length=max_seq_length, load_in_4bit=True,
fast_inference=True, # set True if you want vLLM fast inference max_lora_rank=lora_rank, gpu_memory_utilization=0.7 )

training_args_lyrics = GRPOConfig( use_vllm = True, # use vLLM for fast inference! learning_rate = 5e-6, adam_beta1 = 0.9, adam_beta2 = 0.99, weight_decay = 0.1, warmup_ratio = 0.1, lr_scheduler_type = "cosine", optim = "paged_adamw_8bit", logging_steps = 1, bf16 = is_bfloat16_supported(), fp16 = not is_bfloat16_supported(), per_device_train_batch_size = 4, gradient_accumulation_steps = 2, # Increase to 4 for smoother training num_generations = 8, # Decrease if out of memory max_prompt_length = 768, max_completion_length = 768, num_train_epochs = 2, # Set to 1 for a full training run # max_steps = 50, save_steps = 250, max_grad_norm = 0.1, report_to = "none", # Can use Weights & Biases output_dir = "outputs_lyrics_phase", )

With these settings, the program runs smoothly. It seems that the current models only support vllm-based gradient backpropagation. Without enabling vllm_cache, the first batch of data might be normal, but subsequent batches often encounter repetitive issues. However, once vllm_cache is turned on, the aforementioned problems are resolved!

after update trl==0.15.2 and set use_vllm=True, everything looks good so far. Thank you!

Feb 28 '25 06:02 xudou3

after update trl==0.15.2 and set use_vllm=True, everything looks good so far. Thank you!

but i found a new problem, the batch size seems not working, i can only run with per_device_train_batch_size=1 whatever number I set, it's always 1

Feb 28 '25 07:02 xudou3

after update trl==0.15.2 and set use_vllm=True, everything looks good so far. Thank you!

but i found a new problem, the batch size seems not working, i can only run with per_device_train_batch_size=1 whatever number I set, it's always 1

with H800, I can set the batch_size=8, what is your machine configuration?

Mar 03 '25 05:03 StarLight1212

after update trl==0.15.2 and set use_vllm=True, everything looks good so far. Thank you!

but i found a new problem, the batch size seems not working, i can only run with per_device_train_batch_size=1 whatever number I set, it's always 1

with H800, I can set the batch_size=8, what is your machine configuration?

A800 I can run with batch_size=8 at commit https://github.com/unslothai/unsloth/commit/512fec6a7b77a930b85a5b5685bf056fbb29ff5e but batch_size is always 1 after I do these update

Mar 03 '25 06:03 xudou3

@xudou3 @kings-crown @StarLight1212 Apologies just fixed the gibberish output! For Colab / Kaggle, please restart and run all. For local machines, please do:

pip install --force-reinstall --upgrade --no-cache-dir --no-deps unsloth unsloth_zoo

Mar 05 '25 13:03 danielhanchen

@xudou3 @kings-crown @StarLight1212 Apologies just fixed the gibberish output! For Colab / Kaggle, please restart and run all. For local machines, please do:
pip install --force-reinstall --upgrade --no-cache-dir --no-deps unsloth unsloth_zoo

Thank you. I still cannot set batch size after update code and reinstall these packages. Any suggestion?

Mar 07 '25 07:03 xudou3

vllm_cache

why i do not see this param in theh script:vllm_cache？

Mar 25 '25 11:03 chuangzhidan

@xudou3 @kings-crown @StarLight1212 Apologies just fixed the gibberish output! For Colab / Kaggle, please restart and run all. For local machines, please do:
pip install --force-reinstall --upgrade --no-cache-dir --no-deps unsloth unsloth_zoo

Is this suggested on Linux AND WINDOWS? Just to be sure?

Apr 12 '25 05:04 CAISAMPS