[RFC] split resource pool
Motivation
In certain scenarios, we need to use a resource pool to initialize multiple instances.
https://github.com/volcengine/verl/pull/4226 and https://github.com/volcengine/verl/pull/4233 may introduce some issues, SubRayResourcePool may be a safer implementation without any modification about RayResourcePool itself.
Proposal API change
class RayResourcePool:
...
def split(self, split_size: int | list[int]) -> list["SubRayResourcePool"]:
...
class SubRayResourcePool(RayResourcePool):
def __init__(
self,
start_bundle_idx: int,
subgroup_world_size: int,
**kwargs,
) -> None:
super().__init__(**kwargs)
self.start_bundle_idx = start_bundle_idx
self.subgroup_world_size = subgroup_world_size
@property
def world_size(self):
return self.subgroup_world_size
Expected Behavior:
The resource pool maintains the placement group and bundle information. Conceptually, it looks like:
resource_pool
pgs = [pgs_1, pgs_2, ..., pgs_n]
pgs_1 = [boundle_1, boundle_2, ... boundle_8]
pgs_2 = [boundle_1, boundle_2, ... boundle_8]
...
pgs_n = [boundle_1, boundle_2, ... boundle_8]
To support logical subdivision of the resource pool, I will introduce two additional fields in each resource pool to track the slice of bundles currently in use: subresource_pool.start_bundle_index and subresource_pool.subgroup_world_size (which together act as an offset and length).
@zw0610 @vermouth1992
Make sense to me