When LLM Agents Meet Reinforcement Learning
AgentsMeetRL is an awesome list that summarizes open-source repositories for training LLM Agents using reinforcement learning:
- ๐ค The criteria for identifying an agent project are that it must have at least one of the following: multi-turn interactions or tool use (so TIR projects, Tool-Integrated Reasoning, are considered in this repo).
- โ ๏ธ This project is based on code analysis from open-source repositories using GitHub Copilot Agent, which may contain unfaithful cases. Although manually reviewed, there may still be omissions. If you find any errors, please don't hesitate to let us know immediately through issues or PRs - we warmly welcome them!
- ๐ We particularly focus on the reinforcement learning frameworks, RL algorithms, rewards, and environments that projects depend on, for everyone's reference on how these excellent open-source projects make their technical choices. See [Click to view technical details] under each table.
- ๐ค Feel free to submit your own projects anytime - we welcome contributions!
Some Enumeration:
- Enumeration for Reward Type:
- External Verifier: e.g., a compiler or math solver
- Rule-Based: e.g., a LaTeX parser with exact match scoring
- Model-Based: e.g., a trained verifier LLM or reward LLM
- Custom
๐ง Base Framework
๐ Click to view technical details
| Github Repo |
RL Algorithm |
Single/Multi Agent |
Outcome/Process Reward |
Single/Multi Turn |
Task |
Reward Type |
Tool usage |
| siiRL |
PPO/GRPO/CPGD/MARFT |
Multi |
Both |
Multi |
LLM/VLM/LLM-MAS PostTraining |
Model/Rule |
Planned |
| slime |
GRPO/GSPO/REINFORCE++ |
Single |
Both |
Both |
Math/Code |
External Verifier |
Yes |
| agent-lightning |
PPO/Custom/Automatic Prompt Optimization |
Multi |
Outcome |
Multi |
Calculator/SQL |
Model/External/Rule |
Yes |
| AReaL |
PPO |
Both |
Outcome |
Both |
Math/Code |
External |
Yes |
| ROLL |
PPO/GRPO/Reinforce++/TOPR/RAFT++ |
Multi |
Both |
Multi |
Math/QA/Code/Alignment |
All |
Yes |
| MARTI |
PPO/GRPO/REINFORCE++/TTRL |
Multi |
Both |
Multi |
Math |
All |
Yes |
| RL2 |
Dr. GRPO/PPO/DPO |
Single |
Both |
Both |
QA/Dialogue |
Rule/Model/External |
Yes |
| verifiers |
GRPO |
Multi |
Outcome |
Both |
Reasoning/Math/Code |
All |
Code |
| oat |
PPO/GRPO |
Single |
Outcome |
Multi |
Math/Alignment |
External |
No |
| veRL |
PPO/GRPO |
Single |
Outcome |
Both |
Math/QA/Reasoning/Search |
All |
Yes |
| OpenRLHF |
PPO/REINFORCE++/GRPO/DPO/IPO/KTO/RLOO |
Multi |
Both |
Both |
Dialogue/Chat/Completion |
Rule/Model/External |
Yes |
| trl |
PPO/GRPO/DPO |
Single |
Both |
Single |
QA |
Custom |
No |
๐ช General/MultiTask
๐ Click to view technical details
| Github Repo |
RL Algorithm |
Single/Multi Agent |
Outcome/Process Reward |
Single/Multi Turn |
Task |
Reward Type |
Tool usage |
| DEPO |
KTO + Efficiency Loss |
Single |
Both |
Multi |
Agent (BabyAI/WebShop) |
Rule |
Yes |
| SPEAR |
GRPO/GiGPO + SIL |
Single |
Both |
Multi |
Math/Agent |
Rule/External |
Yes (Search, Sandbox, Browser) |
| AgentRL |
GRPO/REINFORCE++/RLOO/ReMax/GAE |
Single |
Outcome |
Multi |
Agent Tasks |
External |
Yes |
| AgentGym-RL |
PPO/GRPO/RLOO/REINFORCE++ |
Single |
Outcome |
Multi |
Web/Search/Game/Embodied/Science |
Rule/Model/External |
Yes (Web, Search, Env APIs) |
| Agent_Foundation_Models |
DAPO/PPO |
Single |
Outcome |
Single |
QA/Code/Math |
Rule/External |
Yes |
| SPA-RL-Agent |
PPO |
Single |
Process |
Multi |
Navigation/Web/TextGame |
Model |
No |
| verl-agent |
PPO/GRPO/GiGPO/DAPO/RLOO/REINFORCE++ |
Multi |
Both |
Multi |
Phone Use/Math/Code/Web/TextGame |
All |
Yes |
๐ Search/Research/Web
๐ Click to view technical details
| Github Repo |
RL Algorithm |
Single/Multi Agent |
Outcome/Process Reward |
Single/Multi Turn |
Task |
Reward Type |
Tool usage |
| ReSeek |
GRPO/PPO |
Single |
Both |
Multi |
QA/Search |
Rule |
Search/JUDGE |
| Tree-GRPO |
GRPO/Tree-GRPO |
Single |
Outcome |
Multi |
Search |
Rule |
Search |
| ASearcher |
PPO/GRPO + Decoupled PPO |
Single |
Outcome |
Multi |
Math/Code/SearchQA |
External/Rule |
Yes |
| Kimi-Researcher |
REINFORCE |
Single |
Outcome |
Multi |
Research |
Outcome |
Search, Browse, Coding |
| TTI |
REINFORCE/BC |
Single |
Outcome |
Multi |
Web |
External |
Web Browsing |
| R-Search |
PPO/GRPO |
Single |
Both |
Multi |
QA/Search |
All |
Yes |
| R1-Searcher-plus |
Custom |
Single |
Outcome |
Multi |
Search |
Model |
Search |
| StepSearch |
PPO |
Single |
Process |
Multi |
QA |
Model |
Search |
| AutoRefine |
PPO/GRPO |
Multi |
Both |
Multi |
RAG QA |
Rule |
Search |
| ZeroSearch |
PPO/GRPO/REINFORCE |
Single |
Outcome |
Multi |
QA/Search |
Rule |
Yes |
| WebThinker |
DPO |
Single |
Outcome |
Multi |
Reasoning/QA/Research |
Model/External |
Web Browsing |
| DeepResearcher |
PPO/GRPO |
Multi |
Outcome |
Multi |
Research |
All |
Yes |
| Search-R1 |
PPO/GRPO |
Single |
Outcome |
Multi |
Search |
All |
Search |
| R1-Searcher |
PPO/DPO |
Single |
Both |
Multi |
Search |
All |
Yes |
| C-3PO |
PPO |
Multi |
Outcome |
Multi |
Search |
Model |
Yes |
| Search-o1 |
N/A |
Single |
N/A |
Multi |
Math/Science QA/Code/Open QA |
N/A |
Web Search |
| WebAgent |
DAPO |
Multi |
Process |
Multi |
Web |
Model |
Yes |
๐ฑ GUI
๐ Click to view technical details
| Github Repo |
RL Algorithm |
Single/Multi Agent |
Outcome/Process Reward |
Single/Multi Turn |
Task |
Reward Type |
Tool usage |
| MobileAgent |
semi-online RL |
Single |
Both |
Multi |
MobileGUI/Automation |
Rule |
Yes |
| InfiGUI-G1 |
AEPO |
Single |
Outcome |
Single |
GUI/Grounding |
Rule |
No |
| Grounding-R1 |
GRPO |
Single |
Outcome |
Multi |
GUI Grounding |
Model |
Yes |
| AgentCPM-GUI |
GRPO |
Single |
Outcome |
Multi |
Mobile GUI |
Model |
Yes |
| SE-GUI |
GRPO |
Single |
Both |
Single |
GUI Grounding |
Rule |
Yes |
| ARPO |
GRPO |
Single |
Outcome |
Multi |
GUI |
External |
Computer Use |
| GUI-G1 |
GRPO |
Single |
Outcome |
Single |
GUI |
Rule/External |
No |
| GUI-R1 |
GRPO |
Single |
Outcome |
Multi |
GUI |
Rule |
No |
| UI-R1 |
GRPO |
Single |
Process |
Both |
GUI |
Rule |
Computer/Phone Use |
๐จ Tool
๐ Click to view technical details
| Github Repo |
RL Algorithm |
Single/Multi Agent |
Outcome/Process Reward |
Single/Multi Turn |
Task |
Reward Type |
Tool usage |
| MiroRL |
GRPO |
Single |
Both |
Multi |
Reasoning/Planning/ToolUse |
Rule-based |
MCP |
| verl-tool |
PPO/GRPO |
Single |
Both |
Both |
Math/Code |
Rule/External |
Yes |
| Multi-Turn-RL-Agent |
GRPO |
Single |
Both |
Multi |
Tool-use/Math |
Rule/External |
Yes |
| Tool-N1 |
PPO |
Single |
Outcome |
Multi |
Math/Dialogue |
All |
Yes |
| Tool-Star |
PPO/DPO/ORPO/SimPO/KTO |
Single |
Outcome |
Multi |
Multi-modal/Tool Use/Dialogue |
Model/External |
Yes |
| RL-Factory |
GRPO |
Multi |
Both |
Multi |
Tool-use/NL2SQL |
All |
MCP |
| ReTool |
PPO |
Single |
Outcome |
Multi |
Math |
External |
Code |
| AWorld |
GRPO |
Both |
Outcome |
Multi |
Search/Web/Code |
External/Rule |
Yes |
| Agent-R1 |
PPO/GRPO |
Single |
Both |
Multi |
Tool-use/QA |
Model |
Yes |
| ReCall |
PPO/GRPO/RLOO/REINFORCE++/ReMax |
Single |
Outcome |
Multi |
Tool-use/Math/QA |
All |
Yes |
๐ฎ TextGame
๐ Click to view technical details
| Github Repo |
RL Algorithm |
Single/Multi Agent |
Outcome/Process Reward |
Single/Multi Turn |
Task |
Reward Type |
Tool usage |
| ARIA |
REINFORCE |
Both |
Process |
Multi |
Negotiation/Bargaining |
Other |
No |
| AMPO |
BC/AMPO(GRPO improvement) |
Multi |
Outcome |
Multi |
Social Interaction |
Model-based |
No |
| Trinity-RFT |
PPO/GRPO |
Single |
Outcome |
Both |
Math/TextGame/Web |
All |
Yes |
| VAGEN |
PPO/GRPO |
Single |
Both |
Multi |
TextGame/Navigation |
All |
Yes |
| ART |
GRPO |
Multi |
Both |
Multi |
TextGame |
All |
Yes |
| OpenManus-RL |
PPO/DPO/GRPO |
Multi |
Outcome |
Multi |
TextGame |
All |
Yes |
| RAGEN |
PPO/GRPO |
Single |
Both |
Multi |
TextGame |
All |
Yes |
๐ป Code
๐ Click to view technical details
| Github Repo |
RL Algorithm |
Single/Multi Agent |
Outcome/Process Reward |
Single/Multi Turn |
Task |
Reward Type |
Tool usage |
| PPP-Agent |
PPP-RL |
Single |
Both |
Multi |
SWE/Research |
Rule+Model |
Search, Ask, Browse |
| RepoDeepSearch |
GRPO |
Single |
Both |
Multi |
Search/Repair |
Rule/External |
Yes |
| MedAgentGym |
SFT/DPO/PPO/GRPO |
Single |
Outcome |
Multi |
Medical/Code |
External |
Yes |
| CURE |
PPO |
Single |
Outcome |
Single |
Code |
External |
No |
| MASLab |
NO RL |
Multi |
Outcome |
Multi |
Code/Math/Reasoning |
External |
Yes |
| Time-R1 |
PPO/GRPO/DPO |
Multi |
Outcome |
Multi |
Temporal |
All |
Code |
| ML-Agent |
Custom |
Single |
Process |
Multi |
Code |
All |
Yes |
| SkyRL |
PPO/GRPO |
Single |
Outcome |
Multi |
Math/Code |
All |
Code |
| digitalhuman |
PPO/GRPO/ReMax/RLOO |
Multi |
Outcome |
Multi |
Empathy/Math/Code/MultimodalQA |
Rule/Model/External |
Yes |
| sweet_rl |
DPO |
Multi |
Process |
Multi |
Design/Code |
Model |
Web Browsing |
| rllm |
PPO/GRPO |
Single |
Outcome |
Multi |
Code Edit |
External |
Yes |
| open-r1 |
GRPO |
Single |
Outcome |
Single |
Math/Code |
All |
Yes |
๐ค QA(Reasoning/Math)
๐ Click to view technical details
| Github Repo |
RL Algorithm |
Single/Multi Agent |
Outcome/Process Reward |
Single/Multi Turn |
Task |
Reward Type |
Tool usage |
| SafeSearch |
PPO (GAE/GRPO) |
Single |
Both |
Multi |
QA/Search |
Rule + Model |
Search |
| Agent0 |
ADPO |
Multi |
Process |
Multi |
Math/Visual |
Model/Verifier |
Yes |
| KG-R1 |
GRPO/PPO |
Single |
Both |
Multi |
KGQA |
Rule/Model |
KG Retrieval |
| AgentFlow |
Flow-GRPO |
Single |
Outcome |
Multi |
Search/Math/QA |
Model/External |
Yes |
| ARPO |
GRPO |
Single |
Outcome |
Multi |
Math/Coding |
Model/Rule |
Yes |
| terminal-bench-rl |
GRPO |
Single |
Outcome |
Multi |
Coding/Terminal |
Model+External Verifier |
Yes |
| MOTIF |
GRPO |
Single |
Outcome |
Multi |
QA |
Rule |
No |
| cmriat/l0 |
PPO |
Multi |
Process |
Multi |
QA |
All |
Yes |
| agent-distillation |
PPO |
Single |
Process |
Multi |
QA/Math |
External |
Yes |
| VDeepEyes |
PPO/GRPO |
Multi |
Process |
Multi |
VQA |
All |
Yes |
| EasyR1 |
GRPO |
Single |
Process |
Multi |
Vision-Language |
Model |
Yes |
| AutoCoA |
GRPO |
Multi |
Outcome |
Multi |
Reasoning/Math/QA |
All |
Yes |
| ToRL |
GRPO |
Single |
Outcome |
Single |
Math |
Rule/External |
Yes |
| ReMA |
PPO |
Multi |
Outcome |
Multi |
Math |
Rule |
No |
| Agentic-Reasoning |
Custom |
Single |
Process |
Multi |
QA/Math |
External |
Web Browsing |
| SimpleTIR |
PPO/GRPO (with extensions) |
Single |
Outcome |
Multi |
Math, Coding |
All |
Yes |
| openrlhf_async_pipline |
PPO/REINFORCE++/DPO/RLOO |
Single |
Outcome |
Multi |
Dialogue/Reasoning/QA |
All |
No |
๐ง Memory
| Github Repo |
๐ Stars |
Date |
Org |
Paper Link |
RL Framework |
| MEM1 |
 |
2025.7 |
MIT |
Paper |
veRL (based on Search-R1) |
| Memento |
 |
2025.6 |
UCL, Huawei |
Paper |
Custom |
| MemAgent |
 |
2025.6 |
Bytedance, Tsinghua-SIA |
Paper |
veRL |
๐ Click to view technical details
| Github Repo |
RL Algorithm |
Single/Multi Agent |
Outcome/Process Reward |
Single/Multi Turn |
Task |
Reward Type |
Tool usage |
| MEM1 |
PPO/GRPO |
Single |
Outcome |
Multi |
WebShop/GSM8K/QA |
Rule/Model |
Yes |
| Memento |
soft Q-Learning |
Single |
Outcome |
Multi |
Research/QA/Code/Web |
External/Rule |
Yes |
| MemAgent |
PPO, GRPO, DPO |
Multi |
Outcome |
Multi |
Long-context QA |
Rule/Model/External |
Yes |
๐ฆพ Embodied
| Github Repo |
๐ Stars |
Date |
Org |
Paper Link |
RL Framework |
| Embodied-R1 |
 |
2025.6 |
Tianjing University |
Paper |
veRL |
| STeCa |
 |
2025.2 |
The Hong Kong Polytechnic University |
Paper |
FastChat/TRL |
๐ Click to view technical details
| Github Repo |
RL Algorithm |
Single/Multi Agent |
Outcome/Process Reward |
Single/Multi Turn |
Task |
Reward Type |
Tool usage |
| Embodied-R1 |
GRPO |
Single |
Outcome |
Single |
Grounding/Waypoint |
Rule |
No |
| STeCa |
DPO (RFT) |
Single |
Both |
Multi |
Embodied/Household |
Rule/MC |
Environment Actions |
๐ฅ Biomedical
๐ Click to view technical details
| Github Repo |
RL Algorithm |
Single/Multi Agent |
Outcome/Process Reward |
Single/Multi Turn |
Task |
Reward Type |
Tool usage |
| MMedAgent-RL |
Unknown |
Multi |
Unknown |
Unknown |
Unknown |
Unknown |
Unknown |
| DoctorAgent-RL |
GRPO |
Multi |
Both |
Multi |
Consultation/Diagnosis |
Model/Rule |
No |
| Biomni |
TBD |
Single |
TBD |
Single |
scRNAseq/CRISPR/ADMET/Knowledge |
TBD |
Yes |
โฐ๏ธ Environment
Under Review/Waiting for Open Source
Star History

Citation
If you find this repository useful, please consider citing it:
@misc{agentsMeetRL,
title={When LLM Agents Meet Reinforcement Learning: A Comprehensive Survey},
author={AgentsMeetRL Contributors},
year={2025},
url={https://github.com/thinkwee/agentsMeetRL}
}
Made with โค๏ธ by the AgentsMeetRL community