TSPO icon indicating copy to clipboard operation
TSPO copied to clipboard

This is conducting RL training on the 0.4B model. Is my understanding correct?

Open whu125 opened this issue 2 months ago • 1 comments

The GPU usage is approximately 40GB, but during reproduction, I noticed that the loss is always 0, the inter-group advantages calculated are all 0, and the reward for each group is the same. Is this phenomenon normal?

whu125 avatar Sep 30 '25 01:09 whu125

Thank you for your interest in our work!

  1. A loss value of 0 is normal because we only use it to pass gradients, and its computed value should indeed be 0. For details, please refer to lines 595–606 in src/open_tspo/trainer/tspo_trainer.py
  2. "The inter-group advantages calculated are all 0, and the reward for each group is the same" That's so weird. Could you please provide the detailed training logs?

hanzifan avatar Oct 10 '25 07:10 hanzifan