[sglang] Feat: Search Tool Invocation in Multi-Turn RL Training
[Feat]: Search Tool Invocation in Multi-Turn RL Training
Checklist Before
- [x] Search for similar PR(s).
What does this PR do?
-
As veRL users, we want the model to invoke designated tools during the Actor rollout phase and seamlessly integrate their outputs into the training pipeline.
-
We have added search-tool invocation capability to veRL-sglang MultiTurnRL, enabling the model to issue retrieval requests during Actor rollout and directly leverage the returned results for training.
-
providing the community with a reimplementation similar to searchR1.
-
Training curves on Wandb: search_async_rl
-
[x] Third-party training reproducibility verification has been successfully completed.
-
Thanks to the SGlang team and the author of searchR1 for their efficient support!
Project Member:
- Ling Chang (Author)
- Bowen Jin (Advisor on Training)
- Xiaocheng Wang (Advisor on Implementation)
- Nan Jiang (Reproduce)
- Chenyang Zhao (PM)
- Xiang Long (Reviewer, PM)
Checklist Before Submitting
- [x] Read the Contribute Guide.
- [x] Apply pre-commit checks.
- [x] Add [BREAKING] to the PR title if it breaks any API.
- [x] Update the documentation about your changes in the docs.
- [x] Add CI test(s) if necessary.
How to Use
Refer to verl-multiturn-searchR1-like.md or verl-multiturn-searchR1-like_ZH.md in the Awesome-ML-SYS-Tutorial repository.
https://wandb.ai/lingchang-ustc/search_async_rl/workspace?nw=nwuserlingchang
@eric-haibin-lin Here is the training curve
LGTM
great job
@Lins-01 Thanks for your contributions for veRL! I noticed some of the code appears to be referenced from the following projects:
fufankeji/fufan-chat-api encoder.py RUC-NLPIR/FlashRAG utils.py
Could you please to check the licenses of the referenced code to avoid any potential legal issues?
Some code may have been duplicated with an existing PR.
https://github.com/volcengine/verl/pull/1525/files#
Also, a unit test for the search tooling functionality is welcome.
@Lins-01 Thanks for your contributions for veRL! I noticed some of the code appears to be referenced from the following projects:
fufankeji/fufan-chat-api encoder.py RUC-NLPIR/FlashRAG utils.py
Could you please to check the licenses of the referenced code to avoid any potential legal issues?
Some code may have been duplicated with an existing PR.
https://github.com/volcengine/verl/pull/1525/files#
Also, a unit test for the search tooling functionality is welcome.
Appreciate the reminder and encouragement! License attributions have been added — will follow up with the unit test soon.
LGTM. But I strongly suggest the author add necessary unit tests for the search tool using.
will add it with mock search api.
@feifeibear
LGTM. But I strongly suggest the author add necessary unit tests for the search tool using.
We've added the unit tests for the search tool as recommended.
great!
Using the merged patch in this PR, I reran training on the original Search-R1 Wikipedia corpus (GRPO schedule, no additional data) and evaluated the resulting model.
| Dataset | Search-R1 paper (Qwen2.5-3B) | This run |
|---|---|---|
| NQ | 0.397 | 0.406 |
| TriviaQA | 0.565 | 0.582 |
| PopQA | 0.391 | 0.420 |
| HotpotQA | 0.331 | 0.338 |
| 2Wiki | 0.310 | 0.332 |
| Musique | 0.124 | 0.111 |
| Bamboogle | 0.232 | 0.296 |
💾 Weights & full inference script are available on the Hub:
https://huggingface.co/Seungyoun/qwen2.5-3b-it_searchR1-like-multiturn
Everything matches the expected behaviour—tool calls, multi-turn rollout and scores. Thanks again for the thorough work! @Lins-01
Using the merged patch in this PR, I reran training on the original Search-R1 Wikipedia corpus (GRPO schedule, no additional data) and evaluated the resulting model.
Dataset Search-R1 paper (Qwen2.5-3B) This run NQ 0.397 0.406 TriviaQA 0.565 0.582 PopQA 0.391 0.420 HotpotQA 0.331 0.338 2Wiki 0.310 0.332 Musique 0.124 0.111 Bamboogle 0.232 0.296 💾 Weights & full inference script are available on the Hub: https://huggingface.co/Seungyoun/qwen2.5-3b-it_searchR1-like-multiturn
Everything matches the expected behaviour—tool calls, multi-turn rollout and scores. Thanks again for the thorough work! @Lins-01
Wow, thank you for the kind words! Really appreciate your recognition—it’s truly encouraging for our team. If possible, could you share the training hyperparameters you used? I believe it would be helpful for the community (mine were slightly lower—haha).@SeungyounShin