This is conducting RL training on the 0.4B model. Is my understanding correct?

Open whu125 opened this issue 2 months ago • 1 comments

The GPU usage is approximately 40GB, but during reproduction, I noticed that the loss is always 0, the inter-group advantages calculated are all 0, and the reward for each group is the same. Is this phenomenon normal?

Sep 30 '25 01:09 whu125

Thank you for your interest in our work!

A loss value of 0 is normal because we only use it to pass gradients, and its computed value should indeed be 0. For details, please refer to lines 595–606 in src/open_tspo/trainer/tspo_trainer.py
"The inter-group advantages calculated are all 0, and the reward for each group is the same" That's so weird. Could you please provide the detailed training logs?

Oct 10 '25 07:10 hanzifan