Gym icon indicating copy to clipboard operation
Gym copied to clipboard

Docs + Environment pattern: Modeling User Using LLM During Multi-Turn Training

Open cwing-nvidia opened this issue 1 month ago • 0 comments

Background

Often during multi-turn conversational training, users need to simulate realistic user responses during rollout collection.

Problem

Users need guidance on:

  • When to use LLM-based user simulation
  • How to architect and deploy the user simulator LLM
  • Compute management: User simulator LLM requires GPU and resource servers run on CPU - how do they communicate?
  • How to call the user LLM from within the resource server
  • How to prompt the user LLM for realistic behavior
  • How to ensure user simulation quality and diversity

Acceptance Criteria

Conceptual Guidance:

  • [ ] When to use LLM user simulation
  • [ ] Architecture overview: How resource servers call user simulator LLM
  • [ ] Deployment topology options:
    • Co-located: user simulator on same cluster as policy model and Gym
    • Separate: dedicated cluster or external API
  • [ ] Compute resource planning: sizing user simulator for rollout throughput
  • [ ] Quality considerations: how to validate user LLM quality

Implementation Examples:

  • [ ] Configuring user simulator endpoint in resource server
  • [ ] Calling user simulator LLM between agent turns
  • [ ] Configuration examples for different deployment patterns (vLLM, external API)

Priority

High - needed for realistic multi-turn conversational training

Related

  • Should build on multi-turn training environments tutorial
  • Could share deployment patterns with judge model tutorial

cwing-nvidia avatar Nov 13 '25 06:11 cwing-nvidia