Gym
Gym copied to clipboard
Docs + Environment pattern: Modeling User Using LLM During Multi-Turn Training
Background
Often during multi-turn conversational training, users need to simulate realistic user responses during rollout collection.
Problem
Users need guidance on:
- When to use LLM-based user simulation
- How to architect and deploy the user simulator LLM
- Compute management: User simulator LLM requires GPU and resource servers run on CPU - how do they communicate?
- How to call the user LLM from within the resource server
- How to prompt the user LLM for realistic behavior
- How to ensure user simulation quality and diversity
Acceptance Criteria
Conceptual Guidance:
- [ ] When to use LLM user simulation
- [ ] Architecture overview: How resource servers call user simulator LLM
- [ ] Deployment topology options:
- Co-located: user simulator on same cluster as policy model and Gym
- Separate: dedicated cluster or external API
- [ ] Compute resource planning: sizing user simulator for rollout throughput
- [ ] Quality considerations: how to validate user LLM quality
Implementation Examples:
- [ ] Configuring user simulator endpoint in resource server
- [ ] Calling user simulator LLM between agent turns
- [ ] Configuration examples for different deployment patterns (vLLM, external API)
Priority
High - needed for realistic multi-turn conversational training
Related
- Should build on multi-turn training environments tutorial
- Could share deployment patterns with judge model tutorial