Gym
Gym copied to clipboard
Add Deployment Topology Documentation
Add Deployment Topology Documentation
Background
Users have asked whether NeMo Gym runs in the same cluster as NeMo RL, suggesting confusion about the physical deployment model and what compute resources are needed.
Problem
Our documentation doesn't explicitly cover:
- Whether Gym and RL should be co-located or run on separate clusters
- What compute resources each component needs (CPU vs GPU)
- How the orchestration works between the two systems
Acceptance Criteria
- [ ] Add a new "Deployment" section to documentation
- [ ] Explicitly cover cluster co-location strategy (default: same cluster)
- [ ] Document resource requirements:
- NeMo Gym: CPU-only
- Model serving (via NeMo RL): GPU-based
- [ ] Explain the orchestration model (NeMo RL manages training, exposes HTTP endpoint that Gym consumes)
- [ ] Include guidance for when to use co-located vs. separate clusters (future: hybrid clusters)
Priority
High - needed for training