verl icon indicating copy to clipboard operation
verl copied to clipboard

[trainer] feat: make max_colocate_count configurable in ResourcePoolManager

Open JobQiu opened this issue 1 week ago • 1 comments

Summary

Verified on Google Colab

  • Add max_colocate_count field to ResourcePoolManager dataclass
  • Add Ray version check using packaging library (requires >= 2.39.0 for max_colocate_count > 1)
  • Update documentation to clarify parameter usage

Background

The max_colocate_count parameter was hardcoded to 1 due to a Ray limitation (issue ray-project/ray#29811: "GPU placement group doesn't honor the bundle index").

This limitation was fixed in Ray PR ray-project/ray#48088, merged on 2024-11-07 and included in Ray >= 2.39.0.

Now users can configure this parameter to enable multiple processes sharing the same GPU, which is useful for Megatron backend with colocated Actor/Critic models.

Changes

  • verl/trainer/ppo/ray_trainer.py: Add configurable max_colocate_count field with Ray version check
  • test_colocate_colab.md: Test instructions for Colab

Usage

# Default (backward compatible)
resource_pool_manager = ResourcePoolManager(
    resource_pool_spec=spec,
    mapping=mapping
)  # max_colocate_count=1

# For Megatron with GPU sharing
resource_pool_manager = ResourcePoolManager(
    resource_pool_spec=spec,
    mapping=mapping,
    max_colocate_count=2  # Requires Ray >= 2.39.0
)

Test Results

Colab Notebook: https://colab.research.google.com/drive/16gIaB_lNTjaMYq46RdjrHUvloQ2fjMHn

Version Check Test

=== Testing Ray Version Check ===

Testing version comparisons:
  ✅ 2.38.0: False (expected False) - Old stable version
  ✅ 2.39.0: True (expected True) - Minimum required version
  ✅ 2.40.0: True (expected True) - Newer version
  ✅ 2.39.0.dev0: False (expected False) - Dev version (before release)
  ✅ 2.39.0rc1: False (expected False) - Release candidate
  ✅ 2.39.1: True (expected True) - Patch version
  ✅ 2.46.0: True (expected True) - Current version

✅ All tests passed!

GPU Sharing Test (Google Colab)

=== Test: ResourcePoolManager with max_colocate_count=2 ===

✅ ResourcePoolManager created with max_colocate_count=2
   World size: 1
   Max colocate count: 2

✅ Created RayWorkerGroup with 1 workers

Worker information:
   Rank 0: GPU=0, PID=1709

=== Results ===
All workers on same GPU: ✅
Different processes: ✅

🎉 PR TEST PASSED!
   ResourcePoolManager with max_colocate_count=2 works correctly
   1 workers sharing GPU 0

Test plan

  • [x] Verify backward compatibility (default max_colocate_count=1)
  • [x] Verify Ray version check raises error for Ray < 2.39.0
  • [x] Test with max_colocate_count > 1 on Ray >= 2.39.0

Closes #4058

JobQiu avatar Nov 19 '25 23:11 JobQiu

CLA assistant check
All committers have signed the CLA.

CLAassistant avatar Nov 19 '25 23:11 CLAassistant