Add Deployment Topology Documentation

Open cwing-nvidia opened this issue 1 month ago • 0 comments

Add Deployment Topology Documentation

Background

Users have asked whether NeMo Gym runs in the same cluster as NeMo RL, suggesting confusion about the physical deployment model and what compute resources are needed.

Problem

Our documentation doesn't explicitly cover:

Whether Gym and RL should be co-located or run on separate clusters
What compute resources each component needs (CPU vs GPU)
How the orchestration works between the two systems

Acceptance Criteria

[ ] Add a new "Deployment" section to documentation
[ ] Explicitly cover cluster co-location strategy (default: same cluster)
[ ] Document resource requirements:
- NeMo Gym: CPU-only
- Model serving (via NeMo RL): GPU-based
[ ] Explain the orchestration model (NeMo RL manages training, exposes HTTP endpoint that Gym consumes)
[ ] Include guidance for when to use co-located vs. separate clusters (future: hybrid clusters)

Priority

High - needed for training

Nov 11 '25 01:11 cwing-nvidia