HEJIAN SANG
HEJIAN SANG
The current issue to train GPT-OSS model: grad_norm of GRPO grows too fast, which prevents model to achieve reasonable good performance. * Train on gsm8k [PR](https://github.com/volcengine/verl/pull/3836) reasoning effort: medium reasoning...
Train on retool with tool agent: [PR](https://github.com/volcengine/verl/pull/3837) grad_norm can grow as large as 1500
* agent loop training using math-expression example: https://github.com/volcengine/verl/blob/main/recipe/langgraph_agent/example/run_gpt_oss_20b_bf16.sh
We exclude the MOE instability by setting batch_size = mini batch size to enforce on policy.
My only hypothesis is that there is some issue for current gpt-oss model's implementation in transformers which causes the instability of gradient. Your investigation will be really appreciated.
Try importance sampling for gpt-oss training. The good news is taht the grad is no longer exploding but the reward and val are very looking good. My hypothesis is that:...
I can work on supporting this.