verl icon indicating copy to clipboard operation
verl copied to clipboard

[sglang] Feat: Search Tool Invocation in Multi-Turn RL Training

Open Lins-01 opened this issue 7 months ago • 8 comments

[Feat]: Search Tool Invocation in Multi-Turn RL Training

Checklist Before

  • [x] Search for similar PR(s).

What does this PR do?

  • As veRL users, we want the model to invoke designated tools during the Actor rollout phase and seamlessly integrate their outputs into the training pipeline.

  • We have added search-tool invocation capability to veRL-sglang MultiTurnRL, enabling the model to issue retrieval requests during Actor rollout and directly leverage the returned results for training.

  • providing the community with a reimplementation similar to searchR1.

  • Training curves on Wandb: search_async_rl

  • [x] Third-party training reproducibility verification has been successfully completed.

  • Thanks to the SGlang team and the author of searchR1 for their efficient support!

Project Member:

  • Ling Chang (Author)
  • Bowen Jin (Advisor on Training)
  • Xiaocheng Wang (Advisor on Implementation)
  • Nan Jiang (Reproduce)
  • Chenyang Zhao (PM)
  • Xiang Long (Reviewer, PM)

Checklist Before Submitting

  • [x] Read the Contribute Guide.
  • [x] Apply pre-commit checks.
  • [x] Add [BREAKING] to the PR title if it breaks any API.
  • [x] Update the documentation about your changes in the docs.
  • [x] Add CI test(s) if necessary.

How to Use

Refer to verl-multiturn-searchR1-like.md or verl-multiturn-searchR1-like_ZH.md in the Awesome-ML-SYS-Tutorial repository.

Lins-01 avatar May 25 '25 10:05 Lins-01

CLA assistant check
All committers have signed the CLA.

CLAassistant avatar May 25 '25 10:05 CLAassistant

https://wandb.ai/lingchang-ustc/search_async_rl/workspace?nw=nwuserlingchang

@eric-haibin-lin Here is the training curve

zhaochenyang20 avatar May 27 '25 00:05 zhaochenyang20

LGTM

SwordFaith avatar May 28 '25 12:05 SwordFaith

great job

zhaochenyang20 avatar May 28 '25 17:05 zhaochenyang20

@Lins-01 Thanks for your contributions for veRL! I noticed some of the code appears to be referenced from the following projects:

fufankeji/fufan-chat-api encoder.py RUC-NLPIR/FlashRAG utils.py

Could you please to check the licenses of the referenced code to avoid any potential legal issues?

Some code may have been duplicated with an existing PR.

https://github.com/volcengine/verl/pull/1525/files#

Also, a unit test for the search tooling functionality is welcome.

feifeibear avatar May 29 '25 04:05 feifeibear

@Lins-01 Thanks for your contributions for veRL! I noticed some of the code appears to be referenced from the following projects:

fufankeji/fufan-chat-api encoder.py RUC-NLPIR/FlashRAG utils.py

Could you please to check the licenses of the referenced code to avoid any potential legal issues?

Some code may have been duplicated with an existing PR.

https://github.com/volcengine/verl/pull/1525/files#

Also, a unit test for the search tooling functionality is welcome.

Appreciate the reminder and encouragement! License attributions have been added — will follow up with the unit test soon.

Lins-01 avatar May 29 '25 10:05 Lins-01

LGTM. But I strongly suggest the author add necessary unit tests for the search tool using.

will add it with mock search api.

zhaochenyang20 avatar May 29 '25 17:05 zhaochenyang20

@feifeibear

LGTM. But I strongly suggest the author add necessary unit tests for the search tool using.

We've added the unit tests for the search tool as recommended.

Lins-01 avatar May 30 '25 07:05 Lins-01

great!

zhaochenyang20 avatar May 31 '25 05:05 zhaochenyang20

Using the merged patch in this PR, I reran training on the original Search-R1 Wikipedia corpus (GRPO schedule, no additional data) and evaluated the resulting model.

Dataset Search-R1 paper (Qwen2.5-3B) This run
NQ 0.397 0.406
TriviaQA 0.565 0.582
PopQA 0.391 0.420
HotpotQA 0.331 0.338
2Wiki 0.310 0.332
Musique 0.124 0.111
Bamboogle 0.232 0.296

💾 Weights & full inference script are available on the Hub:
https://huggingface.co/Seungyoun/qwen2.5-3b-it_searchR1-like-multiturn

Everything matches the expected behaviour—tool calls, multi-turn rollout and scores. Thanks again for the thorough work! @Lins-01

SeungyounShin avatar Jun 03 '25 05:06 SeungyounShin

Using the merged patch in this PR, I reran training on the original Search-R1 Wikipedia corpus (GRPO schedule, no additional data) and evaluated the resulting model.

Dataset Search-R1 paper (Qwen2.5-3B) This run NQ 0.397 0.406 TriviaQA 0.565 0.582 PopQA 0.391 0.420 HotpotQA 0.331 0.338 2Wiki 0.310 0.332 Musique 0.124 0.111 Bamboogle 0.232 0.296 💾 Weights & full inference script are available on the Hub: https://huggingface.co/Seungyoun/qwen2.5-3b-it_searchR1-like-multiturn

Everything matches the expected behaviour—tool calls, multi-turn rollout and scores. Thanks again for the thorough work! @Lins-01

Wow, thank you for the kind words! Really appreciate your recognition—it’s truly encouraging for our team. If possible, could you share the training hyperparameters you used? I believe it would be helpful for the community (mine were slightly lower—haha).@SeungyounShin

Lins-01 avatar Jun 04 '25 15:06 Lins-01