Is Self-Critique Rubric Reward applied to Verifiable task as well?
Is it in addition to verifiable reward? or is it only applied to non-verifiable tasks?
Thanks!
No, only non-verifiable tasks are evaluated by self-critique reward.
Thank you for the response!
I also noticed that the RL algorithm does not include an Importance Sampling term, which is commonly used in methods like PPO and GRPO, especially in semi off-policy setups. Since Importance Sampling is a standard approach for leveraging samples from older checkpoints to improve the latest policy, could you elaborate on why this term was omitted in your method? Was this an intentional design choice?
I guess it inherits from Kimi 1.5. https://arxiv.org/pdf/2501.12599 And the surrogate loss didn't include an Importance Sampling coefficient.