unsloth icon indicating copy to clipboard operation
unsloth copied to clipboard

format_reward always zero

Open lin-rany opened this issue 1 week ago • 2 comments

when I run the official Llama3_1_(8B)_GRPO.ipynb script, I find that soft_format_reward_func and soft_format_reward_func rewards are always zero, and I haven't modified any code. However, the total rewards are increasing.

As shown in the figure below

Image

config is

trainer = GRPOTrainer(
    model = model,
    processing_class = tokenizer,
    reward_funcs = [
        xmlcount_reward_func,
        soft_format_reward_func,
        strict_format_reward_func,
        int_reward_func,
        correctness_reward_func,
    ],
    args = training_args,
    train_dataset = dataset,
)
trainer.train()

lin-rany avatar Feb 19 '25 13:02 lin-rany