HEJIAN SANG

Results 7 comments of HEJIAN SANG

The current issue to train GPT-OSS model: grad_norm of GRPO grows too fast, which prevents model to achieve reasonable good performance. * Train on gsm8k [PR](https://github.com/volcengine/verl/pull/3836) reasoning effort: medium reasoning...

Train on retool with tool agent: [PR](https://github.com/volcengine/verl/pull/3837) grad_norm can grow as large as 1500

* agent loop training using math-expression example: https://github.com/volcengine/verl/blob/main/recipe/langgraph_agent/example/run_gpt_oss_20b_bf16.sh

We exclude the MOE instability by setting batch_size = mini batch size to enforce on policy.

My only hypothesis is that there is some issue for current gpt-oss model's implementation in transformers which causes the instability of gradient. Your investigation will be really appreciated.

Try importance sampling for gpt-oss training. The good news is taht the grad is no longer exploding but the reward and val are very looking good. My hypothesis is that:...

I can work on supporting this.