Gym

Gym copied to clipboard

Published 1 week ago •

Reame
Issues

Docs + Environment pattern: Modeling User Using LLM During Multi-Turn Training

Open cwing-nvidia opened this issue 1 month ago • 0 comments

Background

Often during multi-turn conversational training, users need to simulate realistic user responses during rollout collection.

Problem

Users need guidance on:

When to use LLM-based user simulation
How to architect and deploy the user simulator LLM
Compute management: User simulator LLM requires GPU and resource servers run on CPU - how do they communicate?
How to call the user LLM from within the resource server
How to prompt the user LLM for realistic behavior
How to ensure user simulation quality and diversity

Acceptance Criteria

Conceptual Guidance:

[ ] When to use LLM user simulation
[ ] Architecture overview: How resource servers call user simulator LLM
[ ] Deployment topology options:
- Co-located: user simulator on same cluster as policy model and Gym
- Separate: dedicated cluster or external API
[ ] Compute resource planning: sizing user simulator for rollout throughput
[ ] Quality considerations: how to validate user LLM quality

Implementation Examples:

[ ] Configuring user simulator endpoint in resource server
[ ] Calling user simulator LLM between agent turns
[ ] Configuration examples for different deployment patterns (vLLM, external API)

Priority

High - needed for realistic multi-turn conversational training

Related

Should build on multi-turn training environments tutorial
Could share deployment patterns with judge model tutorial

Nov 13 '25 06:11 cwing-nvidia

Labels

resource-server

external

documentation

Owner

Other Repo Issues