unsloth
unsloth copied to clipboard
format_reward always zero
when I run the official Llama3_1_(8B)_GRPO.ipynb script, I find that soft_format_reward_func and soft_format_reward_func rewards are always zero, and I haven't modified any code. However, the total rewards are increasing.
As shown in the figure below
config is
trainer = GRPOTrainer(
model = model,
processing_class = tokenizer,
reward_funcs = [
xmlcount_reward_func,
soft_format_reward_func,
strict_format_reward_func,
int_reward_func,
correctness_reward_func,
],
args = training_args,
train_dataset = dataset,
)
trainer.train()