AAPO: Enhancing the Reasoning Capabilities of LLMs with Advantage Momentum

Paper link: https://arxiv.org/abs/2505.14264v2
Authors: Jian Xiong, Jingbo Zhou, Jingyong Ye, Qiang Huang, Dejing Dou

Introduction

In this paper, we propose Advantage-Augmented Policy Optimization (AAPO), a novel RL algorithm that optimizes the cross-entropy (CE) loss using advantages enhanced through a momentum-based estimation scheme. AAPO effectively mitigates the inefficiencies associated with group relative advantage estimation. Experimental results on multiple mathematical reasoning benchmarks and model series demonstrate the superior performance of AAPO.

Here are the main experimental results in our paper. More results about ablation study can be found in our paper.

Environment setup

Training environment

conda create -n train python=3.11
conda activate train
pip install -r requirements.txt

Training

Train DeepSeek-R1-Distill-Qwen-1.5B model

cd open-rs
bash train.sh

Train Qwen2.5-Math-7B model

cd open-r1
bash train.sh

Train Llama series models

Set max_completion_length=3072, max_prompt_length=1024 in train.sh, and set dataset_name to SimpleRL-Zoo-Data/simplelr_abel_level3to5 in config file and clear the system prompt.

cd open-r1
bash train.sh

Evaluation

Evaluate Qwen series models

evaluation on single benckmark

bash single_eval.sh

evaluation on all benchmarks

bash auto_eval.sh

Evaluate Llama series models

You can refer to this repo SimpleRL-Reason.

Contact

If you have any questions, please contact jianxiong_ AT outlook DOT com.

Citation

If you get any thing useful from this work, please cite:

@misc{xiong2025aapoenhancingreasoningcapabilities,
      title={AAPO: Enhancing the Reasoning Capabilities of LLMs with Advantage Momentum}, 
      author={Jian Xiong and Jingbo Zhou and Jingyong Ye and Qiang Huang and Dejing Dou},
      year={2025},
      eprint={2505.14264},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2505.14264}, 
}

Acknowledge

We are grateful for the foundational code provided by SimpleRL-Reason, open-rs and GPG. Utilizing their resources implies agreement to their respective licenses. Our project benefits greatly from these contributions, and we acknowledge their significant impact on our work.

AAPO
AAPO copied to clipboard

Metadata

AAPO: Enhancing the Reasoning Capabilities of LLMs with Advantage Momentum

Introduction

Environment setup

Training environment

Training

Evaluation

Evaluate Qwen series models

Evaluate Llama series models

Contact

Citation

Acknowledge

← Metadata

Owner

Metadata

AAPO AAPO copied to clipboard

Metadata

AAPO: Enhancing the Reasoning Capabilities of LLMs with Advantage Momentum

Introduction

Environment setup

Training environment

Training

Evaluation

Evaluate Qwen series models

Evaluate Llama series models

Contact

Citation

Acknowledge

← Metadata

Owner

Metadata

AAPO
AAPO copied to clipboard