TSPO
TSPO copied to clipboard
This is conducting RL training on the 0.4B model. Is my understanding correct?
The GPU usage is approximately 40GB, but during reproduction, I noticed that the loss is always 0, the inter-group advantages calculated are all 0, and the reward for each group is the same. Is this phenomenon normal?
Thank you for your interest in our work!
- A loss value of 0 is normal because we only use it to pass gradients, and its computed value should indeed be 0. For details, please refer to lines 595–606 in src/open_tspo/trainer/tspo_trainer.py
- "The inter-group advantages calculated are all 0, and the reward for each group is the same" That's so weird. Could you please provide the detailed training logs?