multiturn_eval

Open albertimff opened this issue 1 month ago • 0 comments

Summary

Purpose: reuse the training AgentLoop rollout for single-run multi-turn eval without PPO, decoupling eval from training.
How to run: vLLM async GSM8K example script with key Hydra knobs, outputs, and checkpoint handling.
Checkpoints: supports loading FSDP/FSDP2 training checkpoints via built-in DeviceMesh/process-group compatibility patches.
Validated scenarios: GSM8K, Geo3K, and multimodal; async + vLLM with TP1/TP2 all pass on FSDP checkpoints.
Extensibility: add custom metrics via aggregate_summary / collect_sample_records or AgentLoop agent_metrics.
Note: sglang TP compatibility is still missing (tp=1 fails, multi-TP untested); to be fixed later.
No source code changes—documentation-only addition.

Dec 01 '25 05:12 albertimff